嵌入式linux中文站在线图书

Previous Page
Next Page

10.6. Congestion Management

Congestion management is an important component of the input frame-processing task. An overloaded CPU can become unstable and introduce a big latency into the system. The section "Interrupts" in Chapter 9 explained why the interrupts generated by a high load can cripple the system. For this reason, congestion management mechanisms are needed to make sure the system's stability is not compromised under high network load. Common ways to reduce the CPU load under high traffic loads include:


Reducing the number of interrupts if possible

This is accomplished by coding drivers either to process several frames with a single interrupt (see the section "Processing Multiple Frames During an Interrupt" in Chapter 9), or to use NAPI.


Discarding frames as early as possible in the ingress path

If code knows that a frame is going to be dropped by higher layers, it can save CPU time by dropping the frame quickly. For instance, if a device driver knew that the ingress queue was full, it could drop a frame right away instead of relaying it to the kernel and having the latter drop it.

The second point is what we cover in this section.

A similar optimization applies to the egress path: if a device driver does not have resources to accept new frames for transmission (that is, if the device is out of memory), it would be a waste of CPU time to have the kernel pushing new frames down to the driver for transmission. This point is discussed in Chapter 11 in the section "Enabling and Disabling Transmissions."

In both cases, reception and transmission, the kernel provides a set of functions to set, clear, and retrieve the status of the receive and transmit queues, which allows device drivers (on reception) and the core kernel (on transmission) to perform the optimizations just mentioned.

A good indication of the congestion level is the number of frames that have been received and are waiting to be processed. When a device driver uses NAPI, it is up to the driver to implement any congestion control mechanism. This is because ingress frames are kept in the NIC's memory or in the receive ring managed by the driver, and the kernel cannot keep track of traffic congestion. In contrast, when a device driver does not use NAPI, frames are added to per-CPU queues (softnet_data->input_pkt_queue) and the kernel keeps track of the congestion level of the queues. In this section, we cover this latter case.

Queue theory is a complex topic, and this book is not the place for the mathematical details. I will content myself with one simple point: the current number of frames in the queue does not necessarily represent the real congestion level. An average queue length is a better guide to the queue's status. Keeping track of the average keeps the system from wrongly classifying a burst of traffic as congestion. In the Linux network stack, average queue length is reported by two fields of the softnet_data structure, cng_level and avg_blog, that were introduced in "softnet_data Structure" in Chapter 9.

Being an average, avg_blog could be both bigger and smaller than the length of input_pkt_queue at any time. The former represents recent history and the latter represents the present situation. Because of that, they are used for two different purposes:

  • By default, every time a frame is queued into input_pkt_queue, avg_blog is updated and an associated congestion level is computed and saved into cng_level. The latter is used as the return value by netif_rx so that the device driver that called this function is given a feedback about the queue status and can change its behavior accordingly.

  • The number of frames in input_pkt_queue cannot exceed a maximum size. When that size is reached, following frames are dropped because the CPU is clearly overwhelmed.

Let's go back to the computation and use of the congestion level. avg_blog and cng_level are updated inside get_sample_stats, which is called by netif_rx.

At the moment, few device drivers use the feedback from netif_rx. The most common use of this feedback is to update statistics local to the device drivers. For a more interesting use of the feedback, see drivers/net/tulip/de2104x.c: when netif_rx returns NET_RX_DROP, a local variable drop is set to 1, which causes the main loop to start dropping the frames in the receive ring instead of processing them.

So long as the ingress queue input_pkt_queue is not full, it is the job of the device driver to use the feedback from netif_rx to handle congestion. When the situation gets worse and the input queue fills in, the kernel comes into play and uses the softnet_data->throttle flag to disable frame reception for the CPU. (Remember that there is a softnet_data structure for each CPU.)

10.6.1. Congestion Management in netif_rx

Let's go back to netif_rx and look at some of the code that was omitted from the previous section of this chapter. The following two excerpts include some of the code shown previously, along with new code that shows when a CPU is placed in the throttle state.

    if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
        if (queue->input_pkt_queue.qlen) {
            if (queue->throttle)
                goto drop;
            ... ... ...
            return queue->cng_level;
        }
        ... ...    ...
    }

    if (!queue->throttle) {
        queue->throttle = 1;
        _ _get_cpu_var(netdev_rx_stat).throttled++;
    }

softnet_data->throttle is cleared when the queue gets empty. To be exact, it is cleared by netif_rx when the first frame is queued into an empty queue. It could also happen in process_backlog, as we will see in the section "Backlog Processing: The process_backlog Poll Virtual Function."

10.6.2. Average Queue Length and Congestion-Level Computation

The value of avg_blog and cng_level is always updated within get_sample_stats. The latter can be invoked in two different ways:

  • Every time a new frame is received (netif_rx). This is the default.

  • With a periodic timer. To use this technique, one has to define the OFFLINE_SAMPLE symbol. That's the reason why in netif_rx, the execution of get_sample_stats depends on the definition of the OFFLINE_SAMPLE symbol. It is disabled by default.

The first approach ends up running get_sample_stats more often than the second approach under medium and high traffic load.

In both cases, the formula used to compute avg_blog should be simple and quick, because it could be invoked frequently. The formula used takes into account the recent history and the present:

new_value_for_avg_blog = (old_value_of_avg_blog + current_value_of_queue_len) / 2

How much to weight the present and the past is not a simple problem. The preceding formula can adapt quickly to changes in the congestion level, since the past (the old value) is given only 50% of the weight and the present the other 50%.

get_sample_stats also updates cng_level, basing it on avg_blog through the mapping shown earlier in Figure 9-4 in Chapter 9. If the RAND_LIE symbol is defined, the function performs an extra operation in which it can randomly decide to set cng_level one level higher. This random adjustment requires more time to calculate but, oddly enough, can cause the kernel to perform better under one specific scenario.

Let's spend a few more words on the benefits of random lies. Do not confuse this behavior with Random Early Detection (RED).

In a system with only one interface, it does not really make sense to drop random frames here and there if there is no congestion; it would simply lower the throughput. But let's suppose we have multiple interfaces sharing an input queue and one device with a traffic load much higher than the others. Since the greedy device fills the shared ingress queue faster than the other devices, the latter will often find no space in the ingress queue and therefore their frames will be dropped.[*] The greedy device will also see some of its frames dropped, but not proportionally to its load. When a system with multiple interfaces experiences congestion, it should drop ingress frames across all the devices proportionally to their loads. The RAND_LIE code adds some fairness when used in this context: dropping extra frames randomly should end up dropping them proportionally to the load.

[*] When sharing a queue, it is up to the users to behave fairly with others, but that's not always possible. NAPI does not encounter this problem because each device using NAPI has its own queue. However, non-NAPI drivers still using the shared input queue input_pkt_queue have to live with the possibility of overloading by other devices.


Previous Page
Next Page