嵌入式linux中文站在线图书

Previous Page
Next Page

10.4. Notifying the Kernel of Frame Reception: NAPI and netif_rx

In version 2.5 (then backported to a late revision of 2.4 as well), a new API for handling ingress frames was introduced into the Linux kernel, known (for lack of a better name) as NAPI. Since few devices have been upgraded to NAPI, there are two ways a Linux driver can notify the kernel about a new frame:


By means of the old function netif_rx

This is the approach used by those devices that follow the technique described in the section "Processing Multiple Frames During an Interrupt" in Chapter 9. Most Linux device drivers still use this approach.


By means of the NAPI mechanism

This is the approach used by those devices that follow the technique described in the variation introduced at the end of the section "Processing Multiple Frames During an Interrupt" in Chapter 9. This is new in the Linux kernel, and only a few drivers use it. drivers/net/tg3.c was the first one to be converted to NAPI.

A few device drivers allow you to choose between the two types of interfaces when you configure the kernel options with tools such as make xconfig.

The following piece of code comes from vortex_rx, which still uses the old function netif_rx, and you can expect most of the network device drivers not yet using NAPI to do something similar:

    skb = dev_alloc_skb(pkt_len + 5);
        ... ... ...
    if (skb != NULL) {
        skb->dev = dev;
        skb_reserve(skb, 2);    /* Align IP on 16 byte boundaries */
                ... ... ...
                /* copy the DATA into the sk_buff structure */
                ... ... ...
        skb->protocol = eth_type_trans(skb, dev);
        netif_rx(skb);
        dev->last_rx = jiffies;
            ... ... ...
    }

First, the sk_buff data structure is allocated with dev_alloc_skb (see Chapter 2), and the frame is copied into it. Note that before copying, the code reserves two bytes to align the IP header to a 16-byte boundary. Each network device driver is associated with a given interface type; for instance, the Vortex device driver driver/net/3c59x.c is associated with a specific family of Ethernet cards. Therefore, the driver knows the length of the link layer's header and how to interpret it. Given a header length of 16*k+n, the driver can force an alignment to a 16-byte boundary by simply calling skb_reserve with an offset of 16-n. An Ethernet header is 14 bytes, so k=0, n=14, and the offset requested by the code is 2 (see the definition of NET_IP_ALIGN and the associated comment in include/linux/sk_buff.h).

Note also that at this stage, the driver does not make any distinction between different L3 protocols. It aligns the L3 header to a 16-byte boundary regardless of the type. The L3 protocol is probably IP because of IP's widespread usage, but that is not guaranteed at this point; it could be Netware's IPX or something else. The alignment is useful regardless of the L3 protocol to be used.

eth_type_trans, which is used to extract the protocol identifier skb->protocol, is described in Chapter 13.[*]

[*] Different device types use different functions; for instance, eth_type_trans is used by Ethernet devices and tr_type_trans by Token Ring interfaces.

Depending on the complexity of the driver's design, the block shown may be followed by other housekeeping tasks, but we are not interested in those details in this book. The most important part of the function is the notification to the kernel about the frame's reception.

10.4.1. Introduction to the New API (NAPI)

Even though some of the NIC device drivers have not been converted to NAPI yet, the new infrastructure has been integrated into the kernel, and even the interface between netif_rx and the rest of the kernel has to take NAPI into account. Instead of introducing the old approach (pure netif_rx) first and then talking about NAPI, we will first see NAPI and then show how the old drivers keep their old interface (netif_rx) while sharing some of the new infrastructure mechanisms.

NAPI mixes interrupts with polling and gives higher performance under high traffic load than the old approach, by reducing significantly the load on the CPU. The kernel developers backported that infrastructure to the 2.4 kernels.

In the old model, a device driver generates an interrupt for each frame it receives. Under a high traffic load, the time spent handling interrupts can lead to a considerable waste of resources.

The main idea behind NAPI is simple: instead of using a pure interrupt-driven model, it uses a mix of interrupts and polling. If new frames are received when the kernel has not finished handling the previous ones yet, there is no need for the driver to generate other interrupts: it is just easier to have the kernel keep processing whatever is in the device input queue (with interrupts disabled for the device), and re-enable interrupts once the queue is empty. This way, the driver reaps the advantages of both interrupts and polling:

  • Asynchronous events, such as the reception of one or more frames, are indicated by interrupts so that the kernel does not have to check continuously if the device's ingress queue is empty.

  • If the kernel knows there is something left in the device's ingress queue, there is no need to waste time handling interrupt notifications. A simple polling is enough.

From the kernel processing point of view, here are some of the advantages of the NAPI approach:


Reduced load on the CPU (because there are fewer interrupts)

Given the same workload (i.e., number of frames per second), the load on the CPU is lower with NAPI. This is especially true at high workloads. At low workloads, you may actually have slightly higher CPU usage with NAPI, according to tests posted by the kernel developers on the kernel mailing list.


More fairness in the handling of devices

We will see later how devices that have something in their ingress queues are accessed fairly in a round-robin fashion. This ensures that devices with low traffic can experience acceptable latencies even when other devices are much more loaded.

10.4.2. net_device Fields Used by NAPI

Before looking at NAPI's implementation and use, I need to describe a few fields of the net_device data structure, mentioned in the section "softnet_data Structure" in Chapter 9.

Four new fields have been added to this structure for use by the NET_RX_SOFTIRQ softirq when dealing with devices whose drivers use the NAPI interface. The other devices will not use them, but they will share the fields of the net_device structure embedded in the softnet_data structure as its backlog_dev field.


poll

A virtual function used to dequeue buffers from the device's ingress queue. The queue is a private one for devices using NAPI, and softnet_data->input_pkt_queue for others. See the section "Backlog Processing: The process_backlog Poll Virtual Function."


poll_list

List of devices that have new frames in the ingress queue waiting to be processed. These devices are known as being in polling state. The head of the list is softnet_data->poll_list. Devices in this list have interrupts disabled and the kernel is currently polling them.


quota


weight

quota is an integer that represents the maximum number of buffers that can be dequeued by the poll virtual function in one shot. Its value is incremented in units of weight and it is used to enforce some sort of fairness among different devices. Lower quotas mean lower potential latencies and therefore a lower risk of starving other devices. On the other hand, a low quota increases the amount of switching among devices, and therefore overall overhead.

For devices associated with non-NAPI drivers, the default value of weight is 64, stored in weight_p at the top of net/core/dev.c. The value of weight_p can be changed via /proc.

For devices associated with NAPI drivers, the default value is chosen by the drivers. The most common value is 64, but 16 and 32 are used, too. Its value can be tuned via sysfs.

For both the /proc and sysfs interfaces, see the section "Tuning via /proc and sysfs Filesystems" in Chapter 12.

The section "Old Versus New Driver Interfaces" describes how and when elements are added to poll_list, and the section "Backlog Processing: The process_backlog Poll Virtual Function" describes when the poll method extracts elements from the list and how quota is updated based on the value of weight.

Devices using NAPI initialize these four fields and other net_device fields according to the initialization model described in Chapter 8. For the fake backlog_dev devices, introduced in the section "Initialization of softnet_data" in Chapter 9 and described later in this chapter, the initialization is taken care of by net_dev_init (described in Chapter 5).

10.4.3. net_rx_action and NAPI

Figure 10-1 shows what happens each time the kernel polls for incoming network traffic. In the figure, you can see the relationships among the poll_list list of devices in polling state, the poll virtual function, and the software interrupt handler net_rx_action. The following sections will go into detail on each aspect of that diagram, but it is important to understand how the parts interact before moving to the source code.

Figure 10-1. net_rx_action function and NAPI overview


We already know that net_rx_action is the function associated with the NET_RX_SOFTIRQ flag. For the sake of simplicity, let's suppose that after a period of very low activity, a few devices start receiving frames and that these somehow trigger the execution of net_rx_actionhow they do so is not important for now.

net_rx_action browses the list of devices in polling state and calls the associated poll virtual function for each device to process the frames in the ingress queue. I explained earlier that devices in that list are consulted in a round-robin fashion, and that there is a maximum number of frames they can process each time their poll method is invoked. If they cannot clear the queue during their slot, they have to wait for their next slot to continue. This means that net_rx_action keeps calling the poll method provided by the device driver for a device with something in its ingress queue until the latter empties out. At that point, there is no need anymore for polling, and the device driver can re-enable interrupt notifications for the device. It is important to underline that interrupts are disabled only for those devices in poll_list, which applies only to devices that use NAPI and do not share backlog_dev.

net_rx_action limits its execution time and reschedules itself for execution when it passes a given limit of execution time or processed frames; this is enforced to make net_rx_action behave fairly in relation to other kernel tasks. At the same time, each device limits the number of frames processed by each invocation of its poll method to be fair in relation to other devices. When a device cannot clear out its ingress queue, it has to wait until the next call of its poll method.

10.4.4. Old Versus New Driver Interfaces

Now that the meaning of the NAPI-related fields of the net_device structure, and the high-level idea behind NAPI, should be clear, we can get closer to the source code.

Figure 10-2 shows the difference between a NAPI-aware driver and the others with regard to how the driver tells the kernel about the reception of new frames.

From the device driver perspective, there are only two differences between NAPI and non-NAPI. The first is that NAPI drivers must provide a poll method, described in the section "net_device fields used by NAPI." The second difference is the function called to schedule a frame: non-NAPI drivers call netif_rx, whereas NAPI drivers call _ _netif_rx_schedule, defined in include/linux/netdevice.h. (The kernel provides a wrapper function named netif_rx_schedule, which checks to make sure that the device is running and that the softirq is not already scheduled, and then it calls _ _netif_rx_schedule. These checks are done with netif_rx_schedule_prep. Some drivers call netif_rx_schedule, and others call netif_rx_schedule_prep explicitly and then _ _netif_rx_schedule if needed).

As shown in Figure 10-2, both types of drivers queue the input device to a polling list (poll_list), schedule the NET_RX_SOFTIRQ software interrupt for execution, and therefore end up being handled by net_rx_action. Even though both types of drivers ultimately call _ _netif_rx_schedule (non-NAPI drivers do so within netif_rx), the NAPI devices offer potentially much better performance for the reasons we saw in the section "Notifying Drivers When Frames Are Received" in Chapter 9.

Figure 10-2. NAPI-aware drivers versus non-NAPI-aware devices


An important detail in Figure 10-2 is the net_device structure that is passed to _ _netif_rx_schedule in the two cases. Non-NAPI devices use the one that is built into the CPU's softnet_data structure, and NAPI devices use net_device structures that refer to themselves.

10.4.5. Manipulating poll_list

We saw in the previous section that any device (including the fake one, backlog_dev) is added to the poll_list list with a call to netif_rx_schedule or _ _netif_rx_schedule.

The reverse operation, removing a device from the list, is done with netif_rx_complete or _ _netif_rx_complete (the second one assumes interrupts are already disabled on the local CPU). We will see when these two routines are called in the section "Processing the NET_RX_SOFTIRQ: net_rx_action."

A device can also temporarily disable and re-enable polling with netif_poll_disable and netif_poll_enable, respectively. This does not mean that the device driver has decided to revert to an interrupt-based model. Polling might be disabled on a device, for instance, when the device needs to be reset by the device driver to apply some kind of hardware configuration changes.

I already said that netif_rx_schedule filters requests for devices that are already in the poll_list (i.e., that have the _ _LINK_STATE_RX_SCHED flag set). For this reason, if a driver sets that flag but does not add the device to poll_list, it basically disables polling for the device: the device will never be added to poll_list. This is how netif_poll_disable works: if _ _LINK_STATE_RX_SCHED was not set, it simply sets it and returns. Otherwise, it waits for it to be cleared and then sets it.

static inline void netif_poll_disable(struct net_device *dev)
{
    while (test_and_set_bit(_ _LINK_STATE_RX_SCHED, &dev->state)) {
        /* No hurry. */
        current->state = TASK_INTERRUPTIBLE:
        schedule_timeout(1);
    }
}


Previous Page
Next Page