嵌入式linux中文站在线图书

Previous Page
Next Page

11.1. Enabling and Disabling Transmissions

In the section "Congestion Management" in Chapter 10, we learned about some conditions under which frame reception must be disabled, either on a single device or globally. Something similar applies to frame transmission as well.

The status of the egress queue is represented by the flag _ _LINK_STATE_XOFF in net_device->state. Its value can be manipulated and checked with the following functions, defined in include/linux/netdevice.h:[*]

[*] The other flags in the list are described in Chapters 8 and 10.


netif_start_queue

Enables transmission for the device. It is usually called when the device is activated and can be called again later if needed to restart a stopped device.


netif_stop_queue

Disables transmission for the device. Any attempt to transmit something on the device will be denied. Later in this section is an example of a common case where this function is used.


netif_queue_stopped

Returns the status of the egress queue: enabled or disabled. This function is simply:

static inline int netif_queue_stopped(const struct net_device *dev)
{
    return test_bit(_ _LINK_STATE_XOFF, &dev->state);
}

Only device drivers enable and disable transmission of devices.

Why stop and start a queue once the device is running? One reason is that a device can temporarily use up its memory, thus causing a transmission attempt to fail. In the past, the transmitting function (which I introduce later in the section "dev_queue_xmit Function") would have to deal with this problem by putting the frame back into the queue (requeuing it). Now, thanks to the _ _LINK_STATE_XOFF flag, this extra processing can be avoided. When the device driver realizes that it does not have enough space to store a frame of maximum size (MTU), it stops the egress queue with netif_stop_queue. In this way, it is possible to avoid wasting resources with future transmissions that the kernel already knows will fail. The following example of this throttling at work is taken from vortex_start_xmit (the hard_start_xmit method used by the drivers/net/3c59x.c driver):

    outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2);
    dev_kfree_skb (skb);
    if (inw(ioaddr + TxFree) > 1536) {
        netif_start_queue (dev);    /* AKPM: redundant? */
    } else {
        /* Interrupt us when the FIFO has room for max-sized packet. */
        netif_stop_queue(dev);
        outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
    }

Shortly after the transmission by outsl, the code checks whether there is space for a frame of maximum size (1536), and uses netif_stop_queue to stop the device's egress queue if there is not. This is a relatively crude technique used to avoid transmission failures due to a shortage of memory. Of course, the transmission of a frame of 300 bytes would succeed when just a little more than 300 bytes are left; therefore, checking for 1,536 bytes could disable transmission unnecessarily. The code could compromise by using a lower value, such as 500, but in the end, the gain would not be that big and there could be failures when bigger frames arrive while transmission is enabled.

To cover all eventualities, the code calls netif_start_queue when there is enough memory on the device. The redundant? comment in the code refers to the practice of restarting the queue on two types of interrupts. The driver requests a restart to the queue when the device indicates that it has finished transmitting, and when it indicates that there is enough space in its memory for another frame. Probably, the queue would be restarted promptly if the driver did so on only one of these interrupts, but that's not guaranteed. So the request to restart the queue is issued under both circumstances.

The code also sends a SetTxThreshold command to the device, which instructs the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available.

You may wonder when and how the queue will be re-enabled in the previous scenario. In the case of the Vortex driver, it asks the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available. This is the piece of code that handles such an interrupt:

static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
           ... ... ...
        if (status & TxAvailable) {
            if (vortex_debug > 5)
                printk(KERN_DEBUG "    TX room bit was handled.\n");
            /* There's room in the FIFO for a full-sized packet. */
            outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
            netif_wake_queue (dev);
        }
           ... ... ...
}

The bits of the status variable represent the reasons why the interrupt was generated by the card. The TxAvailable bit indicates that space is available and that it's therefore safe to wake up the device (this is called waking the queue, and is carried out by netif_wake_queue). Values such as EL3_CMD are simply offsets from ioaddr used by the driver to read or write the network card registers at the right positions.

Note that the egress queue is re-enabled with netif_wake_queue instead of netif_start_queue. That new function, which we will see later in more detail, not only enables the egress queue but also asks the kernel to check whether anything in that queue is waiting to be transmitted. The reason is that during the time the queue was disabled, there could have been transmission attempts. In this case, they would have failed, and those frames that could not be sent would have been put back into the egress queue.

11.1.1. Scheduling a Device for Transmission

When describing the ingress path, we saw that when a device receives a frame, its driver invokes a kernel function (the one invoked depends on whether the driver uses NAPI) that adds the device to a polling list and schedules the NET_RX_SOFTIRQ for execution.

Something very similar happens on the egress path. To transmit frames, the kernel provides the dev_queue_xmit function, described later in its own section. This function dequeues a frame from the device's egress queue and feeds it to the device's hard_start_xmit method. However, dev_queue_xmit might not be able to transmit for various reasonsfor instance, because the device's egress queue is disabled, as we saw in the previous section, or because the lock on the device queue is already taken. To handle the latter case, the kernel provides a function called _ _netif_schedule that schedules a device for transmission (somewhat similar to what netif_rx_schedule does on the reception path). This function is never called directly, but through two wrappers shown later in this section.

Here is the function's definition from include/linux/netdevice.h:

static inline void _ _netif_schedule(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) {
        unsigned long flags;
        struct softnet_data *sd;
 
        local_irq_save(flags);
           sd = &_ _get_cpu_var(softnet_data);
        dev->next_sched = sd->output_queue;
        sd->output_queue = dev;
        raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ);
        local_irq_restore(flags);
    }
}

_ _netif_schedule accomplishes two main tasks:

  • It adds the device to the head of the output_queue list. This list is the counterpart to the poll_list list used by reception. There is one output_queue for each CPU, just as there is one poll_list for each CPU. However, output_queue is used by both NAPI and non-NAPI devices, and poll_list is used only to handle non-NAPI devices. The devices in the output_queue list are linked together with the net_device->next_sched pointer. You will see in the section "Processing the NET_TX_SOFTIRQ: net_tx_action" how that list is used.

    We already saw in the section "softnet_data Structure" in Chapter 9 that output_queue represents a list of devices that have something to send (because they failed on previous attempts, as described in the section "Queuing Discipline Interface") or whose egress queues have been re-enabled after having been disabled for a while. Because _ _netif_schedule may be called both inside and outside interrupt context, it disables interrupts while adding the input device to the output_queue list.

  • It schedules the NET_TX_SOFTIRQ softirq for execution. _ _LINK_STATE_SCHED is used to mark devices that are in the output_queue list because they have something to send. (_ _LINK_STATE_SCHED is the counterpart of the reception path's _ _LINK_STATE_RX_SCHED.) Note that if the device was already scheduled for transmission, _ _netif_schedule would not do anything.

Since it does not make sense to schedule a device for transmission if transmission is disabled on the device, the kernel provides two functions to be used instead, both wrappers around _ _netif_schedule:


netif_schedule[*]

Simply makes sure transmission is enabled on the device before scheduling it for transmission:

static inline void netif_schedule(struct net_device *dev)
{
    if (!test_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}


netif_wake_queue

Enables transmission for the device and, if transmission was previously disabled, schedules the device for transmission. This scheduling is needed because there could have been transmission attempts while the device queue was disabled. We saw an example of its use in the previous section.

static inline void netif_wake_queue(struct net_device *dev)
{
    ...
    if (test_and_clear_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}

test_and_clear_bit clears the _ _LINK_STATE_XOFF flag if it is set, and returns the old value.

Note that a call to netif_wake_queue is equivalent to a call to both netif_start_queue and netif_schedule. I said in the section "Enabling and Disabling Transmissions" that it is the responsibility of the driver, not higher-layer functions, to disable and enable transmission on devices. Usually, high-level functions schedule transmissions on devices, and device drivers disable and re-enable the queue when required, such as to handle a shortage of memory. Therefore, it should not come as a surprise that netif_wake_queue is the one used by device drivers, and netif_schedule is the one used elsewhere (for example, by net_tx_action[*] and Traffic Control).

[*] net_tx_action schedules a device for transmission when it cannot grab the dev->queue_lock lock on the device's egress queue and therefore cannot transmit.

A device driver uses netif_wake_queue in the following cases:

  • We will see in the section "Watchdog timer" that device drivers use a watchdog timer to recover from a transmission that hangs. In such a situation, the virtual function net_device->tx_timeout usually resets the card. During that black hole in which the device is not usable, there could be other transmission attempts, so the driver needs to first enable the device's queue and then schedule the device for transmission. The same applies to interrupts that signal error conditions (look at drivers/net/3c59x.c for some examples).

  • When (as previously requested by the driver itself) the device signals to the driver that it has enough memory to handle the transmission of a frame of a given size, the device can be awakened. We already saw an example of this practice in the previous section in relation to the TxAvailable interrupt. The reason for using this function, again, is that during the time the driver has disabled the queue, there could have been transmission attempts. A similar consideration applies to the interrupt type that tells the driver when a driver-to-card DMA transfer has completed.

11.1.2. Queuing Discipline Interface

Almost all devices use a queue to schedule egress traffic, and the kernel can use algorithms known as queuing disciplines to arrange the frames in the most efficient order for transmission. Although a detailed discussion of Traffic Control and its queuing disciplines is outside the scope of this book, in this section I'll provide a brief overview of the interface between device drivers and the transmission layer discussed in this chapter.

Each Traffic Control queuing discipline can provide different function pointers to be called by higher layers to accomplish different tasks. Among the most important functions are:


enqueue

Adds an element to the queue


dequeue

Extracts an element from the queue


requeue

Puts back on the queue an element that was previously extracted (e.g., because of a transmission failure)

Whenever a device is scheduled for transmission, the next frame to transmit is selected by the qdisc_run function, which indirectly calls the dequeue virtual function of the associated queuing discipline.

Once again, the real job is actually done by another function, qdisc_restart. The qdisc_run function, defined in include/linux/pkt_sched.h, is simply a wrapper that filters out requests for devices whose egress queues are disabled:

static inline void qdisc_run(struct net_device *dev)
{
    while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0)
        /* NOTHING */;
}

11.1.2.1. qdisc_restart function

We saw earlier the common cases where a device is scheduled for transmission. Sometimes it is because something in the egress queue is waiting to be transmitted. But at other times, the device is scheduled because the queue has been disabled for a while and therefore there could be something waiting in the queue from previous failed transmission attempts. The driver does not know whether anything has actually arrived; it must schedule the device in case data is waiting. If in fact no data is waiting, the subsequent call to the dequeue method fails. Even if data is waiting, the call can fail because complex queuing disciplines may decide not to transmit any of the data. Therefore, qdisc_restart, defined in net/sched/sch_generic.c, takes various actions based on the return value of the dequeue method.

int qdisc_restart(struct net_device *dev)
{
    struct Qdisc *q = dev->qdisc;
    struct sk_buff *skb;
 
    if ((skb = q->dequeue(q)) != NULL) {

The dequeue function is called at the very start. Let's suppose it succeeded. Transmitting a frame requires the acquisition of two locks:

  • The lock that protects the queue (dev->queue_lock). This is acquired by the caller of qdisc_restart (dev_queue_xmit).

  • The lock on the driver's transmit routine hard_start_xmit (dev->xmit_lock). The lock is managed by this function. When the device driver already implements its own locking, it indicates this by setting the NETIF_F_LLTX flag (lockless transmission feature) in dev->features to tell the upper layers that there is no need to acquire the dev->xmit_lock lock as well. The use of NETIF_F_LLTX allows the kernel to optimize the transmit data path by not acquiring dev->xmit_lock when it is not needed. Of course, there is no need to acquire the lock if the queue is empty.

Note that qdisc_restart does not release the queue_lock immediately after dequeuing a buffer, because the function might have to requeue the buffer right away if it fails to acquire the lock on the driver. The function releases queue_lock when it has the driver lock in hand, and reacquires queue_lock before returning. Ultimately, dev_queue_xmit will take care of releasing it.

When the driver does not support NETIF_F_LLTX and the driver lock is already taken (i.e., spin_trylock returns 0), transmission fails. If qdisc_restart fails to grab the lock on the driver, it means that another CPU is transmitting through the same device. All that qdisc_restart can do in this case is put the frame back into the queue and reschedule the device for transmission, since it does not want to wait. If the function is running on the same CPU that is holding the lock, a loop (i.e., a bug in the code) has been detected and the frame is dropped; otherwise, it is just a collision.

            if (!spin_trylock(&dev->xmit_lock)) {
            collision:
                ...
                goto requeue;
            }
            ...
requeue:
        q->ops->requeue(skb, q);
        netif_schedule(dev);

Once the driver lock is successfully acquired, the lock on the queue is released so that other CPUs can access the queue. Sometimes, there is no need to acquire the driver lock because NETIF_F_LLTX is set. In either case, qdisc_restart is ready to start its real job.

            if (!netif_queue_stopped(dev)) {
                int ret;
                if (netdev_nit)
                    dev_queue_xmit_nit(skb, dev);
 
                ret = dev->hard_start_xmit(skb, dev);
                if (ret == NETDEV_TX_OK) {
                    if (!nolock) {
                        dev->xmit_lock_owner = -1;
                        spin_unlock(&dev->xmit_lock);
                    }
                    spin_lock(&dev->queue_lock);
                    return -1;
                }
                if (ret == NETDEV_TX_LOCKED && nolock) {
                    spin_lock(&dev->queue_lock);
                    goto collision;
                }
            }

We saw in the previous section that qdisc_run has already checked the status of the egress queue with netif_queue_stopped, but here qdisc_restart checks it again. The second check is not superfluous. Consider this scenario: when qdisc_run called netif_queue_stopped, the lock on the driver was not taken yet. By the time the lock is taken, another CPU could have sent something and the card could have run out of buffer space. Therefore, netif_queue_stopped may have returned FALSE before but would now return TRUE.

neTDev_nit represents the number of protocol sniffers registered. If any are registered, dev_queue_xmit_nit is used to deliver a copy of the frame to each. (We saw something similar for reception in netif_receive_skb in Chapter 10.)

Finally we get to the invocation of the device driver's virtual function for frame transmission. The function provided by the device driver is dev->hard_start_xmit, which is defined for each device at initialization time (see Chapter 8). The NEtdEV_TX_XXX values returned by hard_start_xmit routines are listed in include/linux/netdevice.h. Here is how qdisc_restart handles them:


NETDEV_TX_OK[*]

The transmission succeeded. The buffer is not released yet (kfree_skb is not issued). We will see in the section "Processing the NET_TX_SOFTIRQ: net_tx_action" that the driver does not release the buffer itself but asks the kernel to do so by means of the NET_TX_SOFTIRQ softirq. This provides more efficient memory handling than if each driver did its own freeing.


NETDEV_TX_BUSY

The driver has discovered that the NIC lacks sufficient room in its transmit buffer pool. When this condition is detected, the driver often calls netif_stop_queue too (see the section "Enabling and Disabling Transmissions").


NETDEV_TX_LOCKED

The driver is locked. This return value is used only by drivers that support NETIF_F_LLTX.

In summary, transmission fails and a frame must be put back onto the queue when one of the following conditions is true:

  • The queue is disabled (netif_queue_stopped(dev) is true).

  • Another CPU is holding the lock on the driver.

  • The driver failed (hard_start_xmit did not return NEtdEV_TX_OK).

See Figure 11-2 for details of the disc_restart function.

11.1.3. dev_queue_xmit Function

This function is the interface to the device driver that performs a transmission. As shown in Figure 9-2 in Chapter 9, dev_queue_xmit can lead to the execution of the driver transmit function hard_start_xmit tHRough two alternate paths:


Interfacing to Traffic Control (the QoS layer)

This is done through the qdisc_run function that we already described in the previous section.


Invoking hard_start_xmit directly

This is done only for devices that do not use the Traffic Control infrastructures (i.e., virtual devices).

We will look at these cases soon, but let's start with the checks and tasks common to both.

When dev_queue_xmit is called, all the information required to transmit the frame, such as the outgoing device, the next hop, and its link layer address, is ready. Parts VI and VII describe how those parameters are initialized.

Figures 11-3(a) and 11-3(b) describe dev_queue_xmit.

dev_queue_xmit receives only an sk_buff structure as input. This contains all the information the function needs. skb->dev, for instance, is the outgoing device, and skb->data points to the beginning of the payload, whose length is skb->len.

int dev_queue_xmit(struct sk_buff *skb)

The main tasks of dev_queue_xmit are:

  • Checking whether the frame is composed of fragments and whether the device can handle them through scatter/gather DMA; combining the fragments if the device is incapable of doing so. See Chapter 21 for a discussion of fragmented buffers.

  • Making sure the L4 checksum (that is, TCP/UDP) is computed, unless the device computes the checksum in hardware. See Chapter 18 for more details on checksumming.

  • Selecting which frame to transmit (the one pointed to by the input sk_buff may not be the one to transmit because there is a queue to honor).

In the following code, the data payload is a list of fragments when skb_shinfo(skb)->frag_list is non-NULL; otherwise, the payload is a single block. If there are fragments, the code checks whether scatter/gather DMA is a feature supported by the device, and if not, combines the fragments into a single buffer itself. The function must also combine the fragments if any of them are stored in a memory area whose address is too big to be addressed by the device (that is, if illegal_highdma(dev, skb) is true).[*]

[*] Some devices can use only 16-bit addresses, which constrains the portion of addressable memory.

    if (skb_shinfo(skb)->frag_list &&
        !(dev->features&NETIF_F_FRAGLIST) &&
        _ _skb_linearize(skb, GFP_ATOMIC)) {
        goto out_kfree_skb;
    }
 
    if (skb_shinfo(skb)->nr_frags &&
        (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) &&
        _ _skb_linearize(skb, GFP_ATOMIC)) {
        goto out_kfree_skb;
    }

The defragmentation of fragments is done by _ _skb_linearize, which can fail for one of the following reasons:

  • The new buffer required to store the joined fragments failed to be allocated.

  • The sk_buff buffer is shared with some other subsystems (that is, the reference count is bigger than one). In this case, the function does not actually fail, but generates a warning with a call to BUG( ).

The L4 checksum can be calculated both in software and in hardware.[*] Not all network cards can compute the checksum in hardware; the ones that can will set the associated bit flag in net_device->features during device initialization. This tells higher network layers that they do not need to worry about checksumming. The checksum must instead be calculated in software if:

[*] The algorithm used by each protocol to compute the checksum is analyzed in the associated chapters.

  • There is no support for hardware checksumming.

  • The interface can use hardware checksumming only for TCP/UDP packets over IP, but the packet being transmitted does not use IP or uses another L4 protocol over IP.

The software checksum is calculated with skb_checksum_help:

    if (skb->ip_summed == CHECKSUM_HW &&
        (!(dev->features & (NETIF_F_HW_CSUM | NETIF_F_NO_CSUM)) &&
         (!(dev->features & NETIF_F_IP_CSUM) ||
          skb->protocol != htons(ETH_P_IP))))
        if (skb_checksum_help(skb, 0))
            goto out_kfree_skb;

Figure 11-2. qdisc_restart function


Figure 11-3a. dev_queue_xmit function


Once the checksum has been handled, all the headers are ready; the next step is to decide which frame to transmit.

At this point, the behavior depends on whether the device uses the Traffic Control infrastructure and therefore has a queuing discipline assigned. Yes, this may come as a surprise. The function has just processed one buffer (defragmenting and checksumming it if needed) but depending on whether a queuing discipline is used and which one is used, and on the status of the outgoing queue, this buffer may not be the one that will actually be sent next.

11.1.3.1. Queueful devices

When it exists, the queuing discipline of the device is accessible through dev->qdisc. The input frame is queued with the enqueue virtual function, and one frame is then dequeued and transmitted via qdisc_run, described in detail in the section "Queuing Discipline Interface."

    local_bh_disable( );

Figure 11-3b. dev_queue_xmit function


    q = rcu_dereference(dev->qdisc);
    ...
    if (q->enqueue) {
        spin_lock(&dev->queue_lock);
 
        rc = q->enqueue(skb, q);
 
        qdisc_run(dev);
 
        spin_unlock_bh(&dev->queue_lock);
        rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
        goto out;
    }

Note that both enqueuing and dequeuing are protected by the queue_lock lock on the queue. Softirqs are also disabled with local_bh_disable, which also takes care of disabling preemption as required by read-copy-update (RCU).

11.1.3.2. Queueless devices

Some devices, such as the loopback device, do not have a queue: whenever a frame is transmitted, it is immediately delivered. (But because there is no place to requeue them, frames are dropped if something goes wrong; they are not given a second chance.) If you look at loopback_xmit in drivers/net/loopback.c, you will see at the end a direct call to netif_rx, bypassing all the queuing business. We saw in Chapter 10 that netif_rx is the API called by non-NAPI device drivers to put an incoming frame into the input queue and signal higher layers about the event. Since there is no input queue for the loopback device, the transmission function accomplishes two tasks: transmit on one side and receive on the other, as shown in Figure 11-4.

Figure 11-4. (a) Queueful device transmission; (b) loopback transmission


The last part of dev_queue_xmit is used to handle devices without a queuing discipline and therefore without an egress queue. It closely resembles the behavior of qdisc_run covered in the section "Queuing Discipline Interface." There are, however, two differences in the case where no queue is used:

  • When a transmission fails, the driver cannot put the buffer back into any queue because there is no queue, so the buffer is dropped by dev_queue_xmit. If the higher layers are using a reliable protocol such as TCP, the data will eventually be retransmitted; otherwise, it will be lost.

  • The NETIF_F_LLTX feature introduced in the section "qdisc_restart function" is taken care of by the two macros HARD_TX_LOCK and HARD_TX_UNLOCK. HARD_TX_LOCK uses spin_lock rather than spin_trylock: when the driver lock is already taken, dev_queue_xmit spins, waiting for it to be released.

11.1.4. Processing the NET_TX_SOFTIRQ: net_tx_action

We saw in Chapter 10 that the net_rx_action function is the handler associated with NET_RX_SOFTIRQ software interrupts. It is triggered by device drivers (and by itself under some specific conditions) and handles the part of the input frame processing that is postponed by device drivers to the "after interrupt handling phase." In this way, the code executed in interrupt context by the driver does only what is strictly necessary (copy the data in memory and signal the kernel about its existence by generating a software interrupt) and does not force the rest of the system to wait long; later on, the software interrupt takes care of that part of the frame processing that can wait.

net_tx_action works in a similar way. It can be triggered with raise_softirq_irqoff(NET_TX_SOFTIRQ) by devices in two different contexts, to accomplish two main tasks:

  • By netif_wake_queue when transmission is enabled on a device. In this case, it makes sure that frames waiting to be sent are actually sent when all the needed conditions are met (for instance, when the device has enough memory).

  • By dev_kfree_skb_irq when a transmission has completed and the device driver signals with the former routine that the associated buffer can be released. In this case, it deallocates the sk_buff structures associated with successfully transmitted buffers.

The reason for the second task is as follows. We know that when code from the device driver runs in interrupt context, it needs to be as quick as possible. Releasing a buffer can take time, so it is deferred by asking the net_tx_action softirq to take care of it. Instead of using dev_kfree_skb, device drivers use dev_kfree_skb_irq. While the former deallocates the sk_buff (which actually consists of the buffer going back into a per-CPU cache), the latter simply adds the pointer to the buffer being released to the completion_queue list of the softnet_data structure associated with the CPU and lets net_tx_action do the real job later.

Let's see how net_tx_action accomplishes its two tasks.

It starts by deallocating all the buffers that have been added to the completion_queue list by the device drivers' calls to dev_kfree_skb_irq. Because net_tx_action is running outside interrupt context, a device driver could add elements to the list at any time, so net_tx_action must disable interrupts while accessing the softnet_data structure. To keep interrupts disabled as little as possible, it clears the list by setting completion_queue to NULL and saves the pointer to the list in a local variable clist, which no one else can access (note also that each CPU has its own list). This way, it can walk through the list and free each element with _ _kfree_skb, while drivers can continue adding new elements to completion_queue.

    if (sd->completion_queue) {
        struct sk_buff *clist;
 
        local_irq_disable( );
        clist = sd->completion_queue;
        sd->completion_queue = NULL;
        local_irq_enable( );
 
        while (clist != NULL) {
            struct sk_buff *skb = clist;
            clist = clist->next;
 
            BUG_TRAP(!atomic_read(&skb->users));
            _ _kfree_skb(skb);
        }
    }

The second half of the function, which transmits frames, works similarly: it uses a local variable to remain safe from hardware interrupts. Note that for each device, before transmitting anything, the function needs to grab the lock on the output device's queue (dev->queue_lock). If the function fails to grab the lock (because another CPU holds it), it simply reschedules the device for transmission with netif_schedule.

    if (sd->output_queue) {
        struct net_device *head;
 
        local_irq_disable( );
        head = sd->output_queue;
        sd->output_queue = NULL;
        local_irq_enable( );
 
        while (head) {
            struct net_device *dev = head;
            head = head->next_sched;
 
            smp_mb_ _before_clear_bit( );
            clear_bit(_ _LINK_STATE_SCHED, &dev->state);
 
            if (spin_trylock(&dev->queue_lock)) {
                qdisc_run(dev);
                spin_unlock(&dev->queue_lock);
            } else {
                netif_schedule(dev);
            }
        }
    }

We already saw in the section "Queuing Discipline Interface" how qdisc_run works. Devices are handled in a sequential order starting from the head of the list. Because the netif_schedule function (calling _ _netif_schedule internally) adds elements at the head of the list, devices are served in Last In, First Out (LIFO) order, which in some conditions may be unfair.

That completes the net_tx_action function; let's look at some contexts where it can be invoked to free buffers. Some functions that desire to release a buffer can be invoked in different contexts, inside or outside interrupt context. A wrapper is available to handle these cases elegantly:

static inline void dev_kfree_skb_any(struct sk_buff *skb)
{
    if (in_irq( ) || irqs_disabled( ))
        dev_kfree_skb_irq(skb);
    else
        dev_kfree_skb(skb);
}

The dev_kfree_skb_irq function runs when the calling function is in interrupt context, and looks like this:

static inline void dev_kfree_skb_irq(struct sk_buff *skb)
{
    if (atomic_dec_and_test(&skb->users)) {
        struct softnet_data *sd;
        unsigned long flags;
 
        local_irq_save(flags);
           sd = &_ _get_cpu_var(softnet_data);
        skb->next = sd->completion_queue;
        sd->completion_queue = skb;
        raise_softirq_irqoff(NET_TX_SOFTIRQ);
        local_irq_restore(flags);
    }
}

A buffer can be freed only if there are no other references to it (that is, if skb->users is 0).

Let's see an example of how the execution of net_tx_action is triggered by an indirect call to cpu_raise_softirq(cpu, NET_TX_SOFTIRQ) by a device driver. (Another example can be found in the section "Enabling and Disabling Transmissions.")

Among the interrupt types handled by the vortex_interrupt function in drivers/net/3c59x.c we introduced earlier is an interrupt invoked by the device to tell the driver that a DMA transfer from the CPU to the device is completed (DMADone). Since the buffer has been transferred to the device, the sk_buff structure can now be freed. Because the interrupt handler is running in interrupt context, the driver calls dev_kfree_skb_irq.

if (status & DMADone) {
    if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) {
        outw(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
        pci_unmap_single(VORTEX_PCI(vp), vp->tx_skb_dma,
                 (vp->tx_skb->len + 3) & ~3, PCI_DMA_TODEVICE);
        dev_kfree_skb_irq(vp->tx_skb); /* Release the transferred buffer */
        if (inw(ioaddr + TxFree) > 1536) {
            netif_wake_queue(dev);
        } else { /* Interrupt when FIFO has room for max-sized packet. */
            outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
            netif_stop_queue(dev);
        }
    }
}

11.1.4.1. Watchdog timer

We saw in the section "Enabling and Disabling Transmissions" that transmission can be disabled by a device driver when certain conditions are met. The disabling of transmission is supposed to be temporary, so when transmission is not re-enabled within a reasonable amount of time, the kernel assumes the device is experiencing some problems and should be restarted.

This is achieved by a per-device timer that is started with dev_watchdog_up when the device is activated with dev_activate. The timer regularly expires, makes sure everything is OK with the device, and restarts itself. When it detects a problembecause the device's egress queue is disabled (netif_queue_stopped returns TRUE) and too much time has passed since the last frame transmission took placethe timer's handler invokes a routine registered by the device driver, which resets the NIC.

Here are the net_device fields used to implement this mechanism:


trans_start

This is the timestamp initialized by the device driver when the last frame transmission started.


watchdog_timer

This is the timer started by Traffic Control. The handler executed when the timer expires is dev_watchdog, defined in net/sched/sch_generic.c.


watchdog_timeo

This is the amount of time to wait. This is initialized by the device driver. When it is set to 0, watchdog_timer is not started.


tx_timeout

This is the routine provided by the device driver that will be invoked by dev_watchdog to reset the device.

When the timer expires, the kernel handler dev_watchdog takes action by calling the function to which tx_timeout points. The latter normally resets the card and restarts the interface scheduler with netif_wake_queue.

The proper value for watchdog_timeo depends on the interface. If the driver does not set it, it defaults to 5 seconds. The parameters to take into account when defining the value are:


The likelihood of transmission collisions

This is zero for point-to-point links, but can be high on shared and overloaded Ethernet links plugged into hubs.


The interface speed

The slower the interface, the bigger the timeout should be.

The value of watchdog_timeo is usually defined as a multiple of the variable HZ, which represents 1 second. HZ is a global variable whose value depends on the platform (it is defined in the architecture-dependent file include/asm-XXX/param.h). As you can see in Table 11-1, even devices of the same type may take different values for the timeout. The table lists only a few examples; it is not a complete list.

Table 11-1. Transmission timeout used by the most common network cards

Device driver

watchdog_timeo (timeout used)

3c501

HZ

3c505

10*HZ

3c509

(400*HZ)/1000

3c515

(400*HZ)/1000

3c523

HZ

3c527

5*HZ

3c59x

5*HZ

dl2k

4*HZ

Natsemi

2*HZ

Aironet 4500

8*HZ

s2io (10Gbit)

5*HZ

8390

(20*HZ)/100

8139too

6*HZ

b44

5*HZ

tg3

5*HZ

e100

2*HZ

e1000

5*HZ

SIS 900

4*HZ

Tulip family

4*HZ

Intel EtherExpress 16

2*HZ

SLIP

20*HZ


The watchdog timer mechanism is provided by the Traffic Control code. However, advanced device drivers may implement their own watchdog timers, too. See drivers/net/e1000_main.c for an example.


Previous Page
Next Page