10.6. Congestion ManagementCongestion management is an important component of the input frame-processing task. An overloaded CPU can become unstable and introduce a big latency into the system. The section "Interrupts" in Chapter 9 explained why the interrupts generated by a high load can cripple the system. For this reason, congestion management mechanisms are needed to make sure the system's stability is not compromised under high network load. Common ways to reduce the CPU load under high traffic loads include:
The second point is what we cover in this section. A similar optimization applies to the egress path: if a device driver does not have resources to accept new frames for transmission (that is, if the device is out of memory), it would be a waste of CPU time to have the kernel pushing new frames down to the driver for transmission. This point is discussed in Chapter 11 in the section "Enabling and Disabling Transmissions." In both cases, reception and transmission, the kernel provides a set of functions to set, clear, and retrieve the status of the receive and transmit queues, which allows device drivers (on reception) and the core kernel (on transmission) to perform the optimizations just mentioned. A good indication of the congestion level is the number of frames that have been received and are waiting to be processed. When a device driver uses NAPI, it is up to the driver to implement any congestion control mechanism. This is because ingress frames are kept in the NIC's memory or in the receive ring managed by the driver, and the kernel cannot keep track of traffic congestion. In contrast, when a device driver does not use NAPI, frames are added to per-CPU queues (softnet_data->input_pkt_queue) and the kernel keeps track of the congestion level of the queues. In this section, we cover this latter case. Queue theory is a complex topic, and this book is not the place for the mathematical details. I will content myself with one simple point: the current number of frames in the queue does not necessarily represent the real congestion level. An average queue length is a better guide to the queue's status. Keeping track of the average keeps the system from wrongly classifying a burst of traffic as congestion. In the Linux network stack, average queue length is reported by two fields of the softnet_data structure, cng_level and avg_blog, that were introduced in "softnet_data Structure" in Chapter 9. Being an average, avg_blog could be both bigger and smaller than the length of input_pkt_queue at any time. The former represents recent history and the latter represents the present situation. Because of that, they are used for two different purposes:
Let's go back to the computation and use of the congestion level. avg_blog and cng_level are updated inside get_sample_stats, which is called by netif_rx. At the moment, few device drivers use the feedback from netif_rx. The most common use of this feedback is to update statistics local to the device drivers. For a more interesting use of the feedback, see drivers/net/tulip/de2104x.c: when netif_rx returns NET_RX_DROP, a local variable drop is set to 1, which causes the main loop to start dropping the frames in the receive ring instead of processing them. So long as the ingress queue input_pkt_queue is not full, it is the job of the device driver to use the feedback from netif_rx to handle congestion. When the situation gets worse and the input queue fills in, the kernel comes into play and uses the softnet_data->throttle flag to disable frame reception for the CPU. (Remember that there is a softnet_data structure for each CPU.) 10.6.1. Congestion Management in netif_rxLet's go back to netif_rx and look at some of the code that was omitted from the previous section of this chapter. The following two excerpts include some of the code shown previously, along with new code that shows when a CPU is placed in the throttle state. if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { if (queue->input_pkt_queue.qlen) { if (queue->throttle) goto drop; ... ... ... return queue->cng_level; } ... ... ... } if (!queue->throttle) { queue->throttle = 1; _ _get_cpu_var(netdev_rx_stat).throttled++; } softnet_data->throttle is cleared when the queue gets empty. To be exact, it is cleared by netif_rx when the first frame is queued into an empty queue. It could also happen in process_backlog, as we will see in the section "Backlog Processing: The process_backlog Poll Virtual Function." 10.6.2. Average Queue Length and Congestion-Level ComputationThe value of avg_blog and cng_level is always updated within get_sample_stats. The latter can be invoked in two different ways:
The first approach ends up running get_sample_stats more often than the second approach under medium and high traffic load. In both cases, the formula used to compute avg_blog should be simple and quick, because it could be invoked frequently. The formula used takes into account the recent history and the present: new_value_for_avg_blog = (old_value_of_avg_blog + current_value_of_queue_len) / 2 How much to weight the present and the past is not a simple problem. The preceding formula can adapt quickly to changes in the congestion level, since the past (the old value) is given only 50% of the weight and the present the other 50%. get_sample_stats also updates cng_level, basing it on avg_blog through the mapping shown earlier in Figure 9-4 in Chapter 9. If the RAND_LIE symbol is defined, the function performs an extra operation in which it can randomly decide to set cng_level one level higher. This random adjustment requires more time to calculate but, oddly enough, can cause the kernel to perform better under one specific scenario. Let's spend a few more words on the benefits of random lies. Do not confuse this behavior with Random Early Detection (RED). In a system with only one interface, it does not really make sense to drop random frames here and there if there is no congestion; it would simply lower the throughput. But let's suppose we have multiple interfaces sharing an input queue and one device with a traffic load much higher than the others. Since the greedy device fills the shared ingress queue faster than the other devices, the latter will often find no space in the ingress queue and therefore their frames will be dropped.[*] The greedy device will also see some of its frames dropped, but not proportionally to its load. When a system with multiple interfaces experiences congestion, it should drop ingress frames across all the devices proportionally to their loads. The RAND_LIE code adds some fairness when used in this context: dropping extra frames randomly should end up dropping them proportionally to the load.
|