10.5. Old Interface Between Device Drivers and Kernel: First Part of netif_rxThe netif_rx function, defined in net/core/dev.c, is normally called by device drivers when new input frames are waiting to be processed;[*] its job is to schedule the softirq that runs shortly to dequeue and handle the frames. Figure 10-3 shows what it checks for and the flow of its events. The figure is practically longer than the code, but it is useful to help understand how netif_rx reacts to its context.
netif_rx is usually called by a driver while in interrupt context, but there are exceptions, notably when the function is called by the loopback device. For this reason, netif_rx disables interrupts on the local CPU when it starts, and re-enables them when it finishes.[
When looking at the code, one should keep in mind that different CPUs can run netif_rx concurrently. This is not a problem, since each CPU is associated with a private softnet_data structure that maintains state information. Among other things, the CPU's softnet_data structure includes a private input queue (see the section "softnet_data Structure" in Chapter 9). Figure 10-3. netif_rx function![]() This is the function's prototype: int netif_rx(struct sk_buff *skb) Its only input parameter is the buffer received by the device, and the output value is an indication of the congestion level (you can find details in the section "Congestion Management"). The main tasks of netif_rx, whose detailed flowchart is depicted in Figure 10-3, include:
Figure 10-4 shows an example of a system with a bunch of CPUs and devices. Each CPU has its own instance of softnet_data, which includes the private input queue where netif_rx will store ingress frames, and the completion_queue where buffers are sent when they are not needed anymore (see the section "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11). The figure shows an example where CPU 1 receives an RxComplete interrupt from eth0. The associated driver stores the ingress frame into CPU 1's queue. CPU m receives a DMADone interrupt from ethn saying that the transmitted buffer is not needed anymore and can therefore be moved to the completion_queue queue.[*]
10.5.1. Initial Tasks of netif_rxnetif_rx starts by saving the time the function was invoked (which also represents the time the frame was received) into the stamp field of the buffer structure: if (skb->stamp.tv_sec == 0) net_timestamp(&skb->stamp); Saving the timestamp has a CPU costtherefore, net_timestamp initializes skb->stamp only if there is at least one interested user for that field. Interest in the field can be advertised by calling net_enable_timestamp. Do not confuse this assignment with the one done by the device driver right before or after it calls netif_rx: netif_rx(skb); dev->last_rx = jiffies; Figure 10-4. CPU's ingress queues![]() The device driver stores in the net_device structure the time its most recent frame was received, and netif_rx stores the time the frame was received in the buffer itself. Thus, one timestamp is associated with a device and the other one is associated with a frame. Note, moreover, that the two timestamps use two different precisions. The device driver stores the timestamp of the most recent frame in jiffies, which in kernel 2.6 comes with a precision of 10 or 1 ms, depending on the architecture (for instance, before 2.6, the i386 used the value 10, but starting with 2.6 the value is 1). netif_rx, however, gets its timestamp by calling get_fast_time, which returns a far more precise value. The ID of the local CPU is retrieved with smp_processor_id( ) and is stored in the local variable this_cpu: this_cpu = smp_processor_id( ); The local CPU ID is needed to retrieve the data structure associated with that CPU in a per-CPU vector, such as the following code in netif_rx: queue = &_ _get_cpu_var(softnet_data); The preceding line stores in queue a pointer to the softnet_data structure associated with the local CPU that is serving the interrupt triggered by the device driver that called netif_rx. Now netif_rx updates the total number of frames received by the CPU, including both the ones accepted and the ones discarded (because there was no space in the queue, for instance): netdev_rx_stat[this_cpu].total++ Each device driver also keeps statistics, storing them in the private data structure that dev->priv points to. These statistics, which include the number of received frames, the number of dropped frames, etc., are kept on a per-device basis (see Chapter 2), and the ones updated by netif_rx are on a per-CPU basis. 10.5.2. Managing Queues and Scheduling the Bottom HalfThe input queue is managed by softnet_data->input_pkt_queue. Each input queue has a maximum length given by the global variable neTDev_max_backlog, whose value is 300. This means that each CPU can have up to 300 frames in its input queue waiting to be processed, regardless of the number of devices in the system.[*]
Common sense would say that the value of neTDev_max_backlog should depend on the number of devices and their speeds. However, this is hard to keep track of in an SMP system where the interrupts are distributed dynamically among the CPUs. It is not obvious which device will talk to which CPU. Thus, the value of neTDev_max_backlog is chosen through trial and error. In the future, we could imagine it being set dynamically in a manner reflecting the types and number of interfaces. Its value is already configurable by the system administrator, as described in the section "Tuning via /proc and sysfs Filesystems" in Chapter 12. The performance issues are as follows: an unnecessarily large value is a waste of memory, and a slow system may simply never be able to catch up. A value that is too small, on the other hand, could reduce the performance of the device because a burst of traffic could lead to many dropped frames. The optimal value depends a lot on the system's role (host, server, router, etc.). In the previous kernels, when the softnet_data per-CPU data structure was not present, a single input queue, called backlog, was shared by all devices with the same size of 300 frames. The main gain with softnet_data is not that n CPUs leave room on the queues for n*300 frames, but rather, that there is no need for locking among CPUs because each has its own queue. The following code controls the conditions under which netif_rx inserts its new frame on a queue, and the conditions under which it schedules the queue to be run: if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { if (queue->input_pkt_queue.qlen) { if (queue->throttle) goto drop; enqueue: dev_hold(skb->dev); _ _skb_queue_tail(&queue->input_pkt_queue,skb); #ifndef OFFLINE_SAMPLE get_sample_stats(this_cpu); #endif local_irq_restore(flags); return queue->cng_level; } if (queue->throttle) queue->throttle = 0; netif_rx_schedule(&queue->backlog_dev); goto enqueue; } ... ... ... drop: _ _get_cpu_var(netdev_rx_stat).dropped++; local_irq_restore(flags); kfree_skb(skb); return NET_RX_DROP; } The first if statement determines whether there is space. If the queue is full and the statement returns a false result, the CPU is put into a throttle state , which means that it is overloaded by input traffic and therefore is dropping all further frames. The code instituting the throttle is not shown here, but appears in the following section on congestion management. If there is space on the queue, however, that is not sufficient to ensure that the frame is accepted. The CPU could already be in the "throttle" state (as determined by the third if statement), in which case, the frame is dropped. The throttle state can be lifted when the queue is empty. This is what the second if statement tests for. When there is data on the queue and the CPU is in the throttle state, the frame is dropped. But when the queue is empty and the CPU is in the throttle state (which an if statement tests for in the second half of the code shown here), the throttle state is lifted.[*]
The dev_hold(skb->dev) call increases the reference count for the device so that the device cannot be removed until this buffer has been completely processed. The corresponding decrement, done by dev_put, takes place inside net_rx_action, which we will analyze later in this chapter. If all tests are satisfactory, the buffer is queued into the input queue with _ _skb_queue_tail(&queue->input_pkt_queue,skb), the IRQ's status is restored for the CPU, and the function returns. Queuing the frame is extremely fast because it does not involve any memory copying, just pointer manipulation. input_pkt_queue is a list of pointers. _ _skb_queue_tail adds the pointer to the new buffer to the list, without copying the buffer. The NET_RX_SOFTIRQ software interrupt is scheduled for execution with netif_rx_schedule. Note that netif_rx_schedule is called only when the new buffer is added to an empty queue. The reason is that if the queue is not empty, NET_RX_SOFTIRQ has already been scheduled and there is no need to do it again. In the section "Pending softirq Handling" in Chapter 9, we saw how the kernel takes care of scheduled software interrupts. In the upcoming section "Processing the NET_RX_SOFTIRQ: net_rx_action," we will see the internals of the NET_RX_SOFTIRQ softirq's handler. |