11.1. Enabling and Disabling TransmissionsIn the section "Congestion Management" in Chapter 10, we learned about some conditions under which frame reception must be disabled, either on a single device or globally. Something similar applies to frame transmission as well. The status of the egress queue is represented by the flag _ _LINK_STATE_XOFF in net_device->state. Its value can be manipulated and checked with the following functions, defined in include/linux/netdevice.h:[*]
Only device drivers enable and disable transmission of devices. Why stop and start a queue once the device is running? One reason is that a device can temporarily use up its memory, thus causing a transmission attempt to fail. In the past, the transmitting function (which I introduce later in the section "dev_queue_xmit Function") would have to deal with this problem by putting the frame back into the queue (requeuing it). Now, thanks to the _ _LINK_STATE_XOFF flag, this extra processing can be avoided. When the device driver realizes that it does not have enough space to store a frame of maximum size (MTU), it stops the egress queue with netif_stop_queue. In this way, it is possible to avoid wasting resources with future transmissions that the kernel already knows will fail. The following example of this throttling at work is taken from vortex_start_xmit (the hard_start_xmit method used by the drivers/net/3c59x.c driver): outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2); dev_kfree_skb (skb); if (inw(ioaddr + TxFree) > 1536) { netif_start_queue (dev); /* AKPM: redundant? */ } else { /* Interrupt us when the FIFO has room for max-sized packet. */ netif_stop_queue(dev); outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD); } Shortly after the transmission by outsl, the code checks whether there is space for a frame of maximum size (1536), and uses netif_stop_queue to stop the device's egress queue if there is not. This is a relatively crude technique used to avoid transmission failures due to a shortage of memory. Of course, the transmission of a frame of 300 bytes would succeed when just a little more than 300 bytes are left; therefore, checking for 1,536 bytes could disable transmission unnecessarily. The code could compromise by using a lower value, such as 500, but in the end, the gain would not be that big and there could be failures when bigger frames arrive while transmission is enabled. To cover all eventualities, the code calls netif_start_queue when there is enough memory on the device. The redundant? comment in the code refers to the practice of restarting the queue on two types of interrupts. The driver requests a restart to the queue when the device indicates that it has finished transmitting, and when it indicates that there is enough space in its memory for another frame. Probably, the queue would be restarted promptly if the driver did so on only one of these interrupts, but that's not guaranteed. So the request to restart the queue is issued under both circumstances. The code also sends a SetTxThreshold command to the device, which instructs the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available. You may wonder when and how the queue will be re-enabled in the previous scenario. In the case of the Vortex driver, it asks the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available. This is the piece of code that handles such an interrupt: static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) { ... ... ... if (status & TxAvailable) { if (vortex_debug > 5) printk(KERN_DEBUG " TX room bit was handled.\n"); /* There's room in the FIFO for a full-sized packet. */ outw(AckIntr | TxAvailable, ioaddr + EL3_CMD); netif_wake_queue (dev); } ... ... ... } The bits of the status variable represent the reasons why the interrupt was generated by the card. The TxAvailable bit indicates that space is available and that it's therefore safe to wake up the device (this is called waking the queue, and is carried out by netif_wake_queue). Values such as EL3_CMD are simply offsets from ioaddr used by the driver to read or write the network card registers at the right positions. Note that the egress queue is re-enabled with netif_wake_queue instead of netif_start_queue. That new function, which we will see later in more detail, not only enables the egress queue but also asks the kernel to check whether anything in that queue is waiting to be transmitted. The reason is that during the time the queue was disabled, there could have been transmission attempts. In this case, they would have failed, and those frames that could not be sent would have been put back into the egress queue. 11.1.1. Scheduling a Device for TransmissionWhen describing the ingress path, we saw that when a device receives a frame, its driver invokes a kernel function (the one invoked depends on whether the driver uses NAPI) that adds the device to a polling list and schedules the NET_RX_SOFTIRQ for execution. Something very similar happens on the egress path. To transmit frames, the kernel provides the dev_queue_xmit function, described later in its own section. This function dequeues a frame from the device's egress queue and feeds it to the device's hard_start_xmit method. However, dev_queue_xmit might not be able to transmit for various reasonsfor instance, because the device's egress queue is disabled, as we saw in the previous section, or because the lock on the device queue is already taken. To handle the latter case, the kernel provides a function called _ _netif_schedule that schedules a device for transmission (somewhat similar to what netif_rx_schedule does on the reception path). This function is never called directly, but through two wrappers shown later in this section. Here is the function's definition from include/linux/netdevice.h: static inline void _ _netif_schedule(struct net_device *dev) { if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) { unsigned long flags; struct softnet_data *sd; local_irq_save(flags); sd = &_ _get_cpu_var(softnet_data); dev->next_sched = sd->output_queue; sd->output_queue = dev; raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ); local_irq_restore(flags); } } _ _netif_schedule accomplishes two main tasks:
Since it does not make sense to schedule a device for transmission if transmission is disabled on the device, the kernel provides two functions to be used instead, both wrappers around _ _netif_schedule:
Note that a call to netif_wake_queue is equivalent to a call to both netif_start_queue and netif_schedule. I said in the section "Enabling and Disabling Transmissions" that it is the responsibility of the driver, not higher-layer functions, to disable and enable transmission on devices. Usually, high-level functions schedule transmissions on devices, and device drivers disable and re-enable the queue when required, such as to handle a shortage of memory. Therefore, it should not come as a surprise that netif_wake_queue is the one used by device drivers, and netif_schedule is the one used elsewhere (for example, by net_tx_action[*] and Traffic Control).
A device driver uses netif_wake_queue in the following cases:
11.1.2. Queuing Discipline InterfaceAlmost all devices use a queue to schedule egress traffic, and the kernel can use algorithms known as queuing disciplines to arrange the frames in the most efficient order for transmission. Although a detailed discussion of Traffic Control and its queuing disciplines is outside the scope of this book, in this section I'll provide a brief overview of the interface between device drivers and the transmission layer discussed in this chapter. Each Traffic Control queuing discipline can provide different function pointers to be called by higher layers to accomplish different tasks. Among the most important functions are:
Whenever a device is scheduled for transmission, the next frame to transmit is selected by the qdisc_run function, which indirectly calls the dequeue virtual function of the associated queuing discipline. Once again, the real job is actually done by another function, qdisc_restart. The qdisc_run function, defined in include/linux/pkt_sched.h, is simply a wrapper that filters out requests for devices whose egress queues are disabled: static inline void qdisc_run(struct net_device *dev) { while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0) /* NOTHING */; } 11.1.2.1. qdisc_restart functionWe saw earlier the common cases where a device is scheduled for transmission. Sometimes it is because something in the egress queue is waiting to be transmitted. But at other times, the device is scheduled because the queue has been disabled for a while and therefore there could be something waiting in the queue from previous failed transmission attempts. The driver does not know whether anything has actually arrived; it must schedule the device in case data is waiting. If in fact no data is waiting, the subsequent call to the dequeue method fails. Even if data is waiting, the call can fail because complex queuing disciplines may decide not to transmit any of the data. Therefore, qdisc_restart, defined in net/sched/sch_generic.c, takes various actions based on the return value of the dequeue method. int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; if ((skb = q->dequeue(q)) != NULL) { The dequeue function is called at the very start. Let's suppose it succeeded. Transmitting a frame requires the acquisition of two locks:
Note that qdisc_restart does not release the queue_lock immediately after dequeuing a buffer, because the function might have to requeue the buffer right away if it fails to acquire the lock on the driver. The function releases queue_lock when it has the driver lock in hand, and reacquires queue_lock before returning. Ultimately, dev_queue_xmit will take care of releasing it. When the driver does not support NETIF_F_LLTX and the driver lock is already taken (i.e., spin_trylock returns 0), transmission fails. If qdisc_restart fails to grab the lock on the driver, it means that another CPU is transmitting through the same device. All that qdisc_restart can do in this case is put the frame back into the queue and reschedule the device for transmission, since it does not want to wait. If the function is running on the same CPU that is holding the lock, a loop (i.e., a bug in the code) has been detected and the frame is dropped; otherwise, it is just a collision. if (!spin_trylock(&dev->xmit_lock)) { collision: ... goto requeue; } ... requeue: q->ops->requeue(skb, q); netif_schedule(dev); Once the driver lock is successfully acquired, the lock on the queue is released so that other CPUs can access the queue. Sometimes, there is no need to acquire the driver lock because NETIF_F_LLTX is set. In either case, qdisc_restart is ready to start its real job. if (!netif_queue_stopped(dev)) { int ret; if (netdev_nit) dev_queue_xmit_nit(skb, dev); ret = dev->hard_start_xmit(skb, dev); if (ret == NETDEV_TX_OK) { if (!nolock) { dev->xmit_lock_owner = -1; spin_unlock(&dev->xmit_lock); } spin_lock(&dev->queue_lock); return -1; } if (ret == NETDEV_TX_LOCKED && nolock) { spin_lock(&dev->queue_lock); goto collision; } } We saw in the previous section that qdisc_run has already checked the status of the egress queue with netif_queue_stopped, but here qdisc_restart checks it again. The second check is not superfluous. Consider this scenario: when qdisc_run called netif_queue_stopped, the lock on the driver was not taken yet. By the time the lock is taken, another CPU could have sent something and the card could have run out of buffer space. Therefore, netif_queue_stopped may have returned FALSE before but would now return TRUE. neTDev_nit represents the number of protocol sniffers registered. If any are registered, dev_queue_xmit_nit is used to deliver a copy of the frame to each. (We saw something similar for reception in netif_receive_skb in Chapter 10.) Finally we get to the invocation of the device driver's virtual function for frame transmission. The function provided by the device driver is dev->hard_start_xmit, which is defined for each device at initialization time (see Chapter 8). The NEtdEV_TX_XXX values returned by hard_start_xmit routines are listed in include/linux/netdevice.h. Here is how qdisc_restart handles them:
In summary, transmission fails and a frame must be put back onto the queue when one of the following conditions is true:
See Figure 11-2 for details of the disc_restart function. 11.1.3. dev_queue_xmit FunctionThis function is the interface to the device driver that performs a transmission. As shown in Figure 9-2 in Chapter 9, dev_queue_xmit can lead to the execution of the driver transmit function hard_start_xmit tHRough two alternate paths:
We will look at these cases soon, but let's start with the checks and tasks common to both. When dev_queue_xmit is called, all the information required to transmit the frame, such as the outgoing device, the next hop, and its link layer address, is ready. Parts VI and VII describe how those parameters are initialized. Figures 11-3(a) and 11-3(b) describe dev_queue_xmit. dev_queue_xmit receives only an sk_buff structure as input. This contains all the information the function needs. skb->dev, for instance, is the outgoing device, and skb->data points to the beginning of the payload, whose length is skb->len. int dev_queue_xmit(struct sk_buff *skb) The main tasks of dev_queue_xmit are:
In the following code, the data payload is a list of fragments when skb_shinfo(skb)->frag_list is non-NULL; otherwise, the payload is a single block. If there are fragments, the code checks whether scatter/gather DMA is a feature supported by the device, and if not, combines the fragments into a single buffer itself. The function must also combine the fragments if any of them are stored in a memory area whose address is too big to be addressed by the device (that is, if illegal_highdma(dev, skb) is true).[*]
if (skb_shinfo(skb)->frag_list && !(dev->features&NETIF_F_FRAGLIST) && _ _skb_linearize(skb, GFP_ATOMIC)) { goto out_kfree_skb; } if (skb_shinfo(skb)->nr_frags && (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) && _ _skb_linearize(skb, GFP_ATOMIC)) { goto out_kfree_skb; } The defragmentation of fragments is done by _ _skb_linearize, which can fail for one of the following reasons:
The L4 checksum can be calculated both in software and in hardware.[*] Not all network cards can compute the checksum in hardware; the ones that can will set the associated bit flag in net_device->features during device initialization. This tells higher network layers that they do not need to worry about checksumming. The checksum must instead be calculated in software if:
The software checksum is calculated with skb_checksum_help: if (skb->ip_summed == CHECKSUM_HW && (!(dev->features & (NETIF_F_HW_CSUM | NETIF_F_NO_CSUM)) && (!(dev->features & NETIF_F_IP_CSUM) || skb->protocol != htons(ETH_P_IP)))) if (skb_checksum_help(skb, 0)) goto out_kfree_skb; Figure 11-2. qdisc_restart function![]() Figure 11-3a. dev_queue_xmit function![]() Once the checksum has been handled, all the headers are ready; the next step is to decide which frame to transmit. At this point, the behavior depends on whether the device uses the Traffic Control infrastructure and therefore has a queuing discipline assigned. Yes, this may come as a surprise. The function has just processed one buffer (defragmenting and checksumming it if needed) but depending on whether a queuing discipline is used and which one is used, and on the status of the outgoing queue, this buffer may not be the one that will actually be sent next. 11.1.3.1. Queueful devicesWhen it exists, the queuing discipline of the device is accessible through dev->qdisc. The input frame is queued with the enqueue virtual function, and one frame is then dequeued and transmitted via qdisc_run, described in detail in the section "Queuing Discipline Interface." local_bh_disable( ); Figure 11-3b. dev_queue_xmit function![]() q = rcu_dereference(dev->qdisc); ... if (q->enqueue) { spin_lock(&dev->queue_lock); rc = q->enqueue(skb, q); qdisc_run(dev); spin_unlock_bh(&dev->queue_lock); rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; goto out; } Note that both enqueuing and dequeuing are protected by the queue_lock lock on the queue. Softirqs are also disabled with local_bh_disable, which also takes care of disabling preemption as required by read-copy-update (RCU). 11.1.3.2. Queueless devicesSome devices, such as the loopback device, do not have a queue: whenever a frame is transmitted, it is immediately delivered. (But because there is no place to requeue them, frames are dropped if something goes wrong; they are not given a second chance.) If you look at loopback_xmit in drivers/net/loopback.c, you will see at the end a direct call to netif_rx, bypassing all the queuing business. We saw in Chapter 10 that netif_rx is the API called by non-NAPI device drivers to put an incoming frame into the input queue and signal higher layers about the event. Since there is no input queue for the loopback device, the transmission function accomplishes two tasks: transmit on one side and receive on the other, as shown in Figure 11-4. Figure 11-4. (a) Queueful device transmission; (b) loopback transmission![]() The last part of dev_queue_xmit is used to handle devices without a queuing discipline and therefore without an egress queue. It closely resembles the behavior of qdisc_run covered in the section "Queuing Discipline Interface." There are, however, two differences in the case where no queue is used:
11.1.4. Processing the NET_TX_SOFTIRQ: net_tx_actionWe saw in Chapter 10 that the net_rx_action function is the handler associated with NET_RX_SOFTIRQ software interrupts. It is triggered by device drivers (and by itself under some specific conditions) and handles the part of the input frame processing that is postponed by device drivers to the "after interrupt handling phase." In this way, the code executed in interrupt context by the driver does only what is strictly necessary (copy the data in memory and signal the kernel about its existence by generating a software interrupt) and does not force the rest of the system to wait long; later on, the software interrupt takes care of that part of the frame processing that can wait. net_tx_action works in a similar way. It can be triggered with raise_softirq_irqoff(NET_TX_SOFTIRQ) by devices in two different contexts, to accomplish two main tasks:
The reason for the second task is as follows. We know that when code from the device driver runs in interrupt context, it needs to be as quick as possible. Releasing a buffer can take time, so it is deferred by asking the net_tx_action softirq to take care of it. Instead of using dev_kfree_skb, device drivers use dev_kfree_skb_irq. While the former deallocates the sk_buff (which actually consists of the buffer going back into a per-CPU cache), the latter simply adds the pointer to the buffer being released to the completion_queue list of the softnet_data structure associated with the CPU and lets net_tx_action do the real job later. Let's see how net_tx_action accomplishes its two tasks. It starts by deallocating all the buffers that have been added to the completion_queue list by the device drivers' calls to dev_kfree_skb_irq. Because net_tx_action is running outside interrupt context, a device driver could add elements to the list at any time, so net_tx_action must disable interrupts while accessing the softnet_data structure. To keep interrupts disabled as little as possible, it clears the list by setting completion_queue to NULL and saves the pointer to the list in a local variable clist, which no one else can access (note also that each CPU has its own list). This way, it can walk through the list and free each element with _ _kfree_skb, while drivers can continue adding new elements to completion_queue. if (sd->completion_queue) { struct sk_buff *clist; local_irq_disable( ); clist = sd->completion_queue; sd->completion_queue = NULL; local_irq_enable( ); while (clist != NULL) { struct sk_buff *skb = clist; clist = clist->next; BUG_TRAP(!atomic_read(&skb->users)); _ _kfree_skb(skb); } } The second half of the function, which transmits frames, works similarly: it uses a local variable to remain safe from hardware interrupts. Note that for each device, before transmitting anything, the function needs to grab the lock on the output device's queue (dev->queue_lock). If the function fails to grab the lock (because another CPU holds it), it simply reschedules the device for transmission with netif_schedule. if (sd->output_queue) { struct net_device *head; local_irq_disable( ); head = sd->output_queue; sd->output_queue = NULL; local_irq_enable( ); while (head) { struct net_device *dev = head; head = head->next_sched; smp_mb_ _before_clear_bit( ); clear_bit(_ _LINK_STATE_SCHED, &dev->state); if (spin_trylock(&dev->queue_lock)) { qdisc_run(dev); spin_unlock(&dev->queue_lock); } else { netif_schedule(dev); } } } We already saw in the section "Queuing Discipline Interface" how qdisc_run works. Devices are handled in a sequential order starting from the head of the list. Because the netif_schedule function (calling _ _netif_schedule internally) adds elements at the head of the list, devices are served in Last In, First Out (LIFO) order, which in some conditions may be unfair. That completes the net_tx_action function; let's look at some contexts where it can be invoked to free buffers. Some functions that desire to release a buffer can be invoked in different contexts, inside or outside interrupt context. A wrapper is available to handle these cases elegantly: static inline void dev_kfree_skb_any(struct sk_buff *skb) { if (in_irq( ) || irqs_disabled( )) dev_kfree_skb_irq(skb); else dev_kfree_skb(skb); } The dev_kfree_skb_irq function runs when the calling function is in interrupt context, and looks like this: static inline void dev_kfree_skb_irq(struct sk_buff *skb) { if (atomic_dec_and_test(&skb->users)) { struct softnet_data *sd; unsigned long flags; local_irq_save(flags); sd = &_ _get_cpu_var(softnet_data); skb->next = sd->completion_queue; sd->completion_queue = skb; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); } } A buffer can be freed only if there are no other references to it (that is, if skb->users is 0). Let's see an example of how the execution of net_tx_action is triggered by an indirect call to cpu_raise_softirq(cpu, NET_TX_SOFTIRQ) by a device driver. (Another example can be found in the section "Enabling and Disabling Transmissions.") Among the interrupt types handled by the vortex_interrupt function in drivers/net/3c59x.c we introduced earlier is an interrupt invoked by the device to tell the driver that a DMA transfer from the CPU to the device is completed (DMADone). Since the buffer has been transferred to the device, the sk_buff structure can now be freed. Because the interrupt handler is running in interrupt context, the driver calls dev_kfree_skb_irq. if (status & DMADone) { if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) { outw(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */ pci_unmap_single(VORTEX_PCI(vp), vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, PCI_DMA_TODEVICE); dev_kfree_skb_irq(vp->tx_skb); /* Release the transferred buffer */ if (inw(ioaddr + TxFree) > 1536) { netif_wake_queue(dev); } else { /* Interrupt when FIFO has room for max-sized packet. */ outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD); netif_stop_queue(dev); } } } 11.1.4.1. Watchdog timerWe saw in the section "Enabling and Disabling Transmissions" that transmission can be disabled by a device driver when certain conditions are met. The disabling of transmission is supposed to be temporary, so when transmission is not re-enabled within a reasonable amount of time, the kernel assumes the device is experiencing some problems and should be restarted. This is achieved by a per-device timer that is started with dev_watchdog_up when the device is activated with dev_activate. The timer regularly expires, makes sure everything is OK with the device, and restarts itself. When it detects a problembecause the device's egress queue is disabled (netif_queue_stopped returns TRUE) and too much time has passed since the last frame transmission took placethe timer's handler invokes a routine registered by the device driver, which resets the NIC. Here are the net_device fields used to implement this mechanism:
When the timer expires, the kernel handler dev_watchdog takes action by calling the function to which tx_timeout points. The latter normally resets the card and restarts the interface scheduler with netif_wake_queue. The proper value for watchdog_timeo depends on the interface. If the driver does not set it, it defaults to 5 seconds. The parameters to take into account when defining the value are:
The value of watchdog_timeo is usually defined as a multiple of the variable HZ, which represents 1 second. HZ is a global variable whose value depends on the platform (it is defined in the architecture-dependent file include/asm-XXX/param.h). As you can see in Table 11-1, even devices of the same type may take different values for the timeout. The table lists only a few examples; it is not a complete list.
The watchdog timer mechanism is provided by the Traffic Control code. However, advanced device drivers may implement their own watchdog timers, too. See drivers/net/e1000_main.c for an example. |