9.2. Notifying Drivers When Frames Are ReceivedIn Chapter 5, I mentioned that devices and the kernel can use two main techniques for exchanging data: polling and interrupts. I also said that a combination of the two is also a valid option. This section offers a brief overview of the most common ways for a driver to notify the kernel about the reception of a frame, along with the main pros and cons for each one. Some approaches depend on the availability of specific features on the devices (such as ad hoc timers), and some need changes to the driver, the operating system, or both. Figure 9-2. Ingress path (frame reception)![]() This discussion could theoretically apply to any device type, but it best describes those devices like network cards that can generate a high number of interactions (that is, the reception of frames). 9.2.1. PollingWith this technique, the kernel constantly keeps checking whether the device has anything to say. It can do that by continually reading a memory register on the device, for instance, or returning to check it when a timer expires. As you can imagine, this approach can easily waste quite a lot of system resources, and is rarely employed if the operating system and device can use other techniques such as interrupts. Still, there are cases where polling is the best approach. We will come back to this point later. 9.2.2. InterruptsHere the device driver, on behalf of the kernel, instructs the device to generate a hardware interrupt when specific events occur. The kernel, interrupted from its other activities, will then invoke a handler registered by the driver to take care of the device's needs. When the event is the reception of a frame, the handler queues the frame somewhere and notifies the kernel about it. This technique, which is quite common, still represents the best option under low traffic loads. Unfortunately, it does not perform well under high traffic loads: forcing an interrupt for each frame received can easily make the CPU waste all of its time handling interrupts. The code that takes care of an input frame is split into two parts: first the driver copies the frame into an input queue accessible by the kernel, and then the kernel processes it (usually passing it to a handler dedicated to the associated protocol such as IP). The first part is executed in interrupt context and can preempt the execution of the second part. This means that the code that accepts input frames and copies them into the queue has higher priority than the code that actually processes the frames. Under a high traffic load, the interrupt code would keep preempting the processing code. The consequence is obvious: at some point the input queue will be full, but since the code that is supposed to dequeue and process those frames does not have a chance to run due to its lower priority, the system collapses. New frames cannot be queued since there is no space, and old frames cannot be processed because there is no CPU available for them. This condition is called receive-livelock in the literature. In summary, this technique has the advantage of very low latency between the reception of the frame and its processing, but does not work well under high loads. Most network drivers use interrupts, and a large section later in this chapter will discuss how they work. 9.2.3. Processing Multiple Frames During an InterruptThis approach is used by quite a few Linux device drivers. When an interrupt is notified and the driver handler is executed, the latter keeps downloading frames and queuing them to the kernel input queue, up to a maximum number of frames (or a window of time). Of course, it would be possible to keep doing that until the queue gets empty, but let's remember that device drivers should behave as good citizens. They have to share the CPU with other subsystems and IRQ lines with other devices. Polite behavior is especially important because interrupts are disabled while the driver handler is running. Storage limitations also apply, as they did in the previous section. Each device has a limited amount of memory, and therefore the number of frames it can store is limited. If the driver does not process them in a timely manner, the buffers can get full and new frames (or old ones, depending on the driver policies) could be dropped. If a loaded device kept processing incoming frames until its queue emptied out, this form of starvation could happen to other devices. This technique does not require any change to the operating system; it is implemented entirely within the device driver. There could be other variations to this approach. Instead of keeping all interrupts disabled and having the driver queue frames for the kernel to handle, a driver could disable interrupts only for a device that has frames in its ingress queue and delegate the task of polling the driver's queue to a kernel handler. This is exactly what Linux does with its new interface, NAPI. However, unlike the approach described in this section, NAPI requires changes to the kernel. 9.2.4. Timer-Driven InterruptsThis technique is an enhancement to the previous ones. Instead of having the device asynchronously notify the driver about frame receptions, the driver instructs the device to generate an interrupt at regular intervals. The handler will then check if any frames have arrived since the previous interrupt, and handles all of them in one shot. Even better would be to have the driver generate interrupts at intervals, but only if it has something to say. Based on the granularity of the timer (which is implemented in hardware by the device itself; it is not a kernel timer), the frames that are received by the device will experience different levels of latency. For instance, if the device generated an interrupt every 100 ms, the notification of the reception of a frame would have an average delay of 50 ms and a maximum one of 100 ms. This delay may or may not be acceptable depending on the applications running on top of the network connections using the device.[*]
The granularity available to a driver depends on what the device has to offer, since the timer is implemented in hardware. Only a few devices provide this capability currently, so this solution is not available for all the drivers in the Linux kernel. One could simulate that capability by disabling interrupts for the device and using a kernel timer instead. However, one would not have the support of the hardware, and the CPU cannot spend as much of its resources as the device can on handling timers, so one would not be able to schedule the timers nearly as often. This workaround would, in the end, become a polling approach. 9.2.5. CombinationsEach approach described in the previous sections has some advantages and disadvantages. Sometimes, it is possible to combine them and obtain something even better. We said that under low load, the pure interrupt model guarantees a low latency, but that under high load it performs terribly. On the other hand, the timer-driven interrupt may introduce too much latency and waste too much CPU time under low load, but it helps a lot in reducing the CPU usage and solving the receive-livelock problem under high load. A good combination would use the interrupt technique under low load and switch to the timer-driven interrupt under high load. The tulip driver included in the Linux kernel, for instance, can do this (see drivers/net/tulip/interrupt.c[*]).
9.2.6. ExampleA balanced approach to processing multiple frames is shown in the following piece of code, taken from the drivers/net/3c59x.c Ethernet driver. It is a selection of key lines from vortex_interrupt, the function registered by the driver as the handler of interrupts from devices in 3Com's Vortex family: static irqreturn_t vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) { int work_done = max_interrupt_work; ioaddr = dev->base_addr; ... ... ... status = inw(ioaddr + EL3_STATUS); do { ... ... ... if (status & RxComplete) vortex_rx(dev); if (--work_done < 0) { /* Disable all pending interrupts. */ ... ... ... /* The timer will re-enable interrupts. */ mod_timer(&vp->timer, jiffies + 1*HZ); break; } ... ... ... } while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete)); ... ... ... } Other drivers that follow the same model will have something very similar. They probably will call the EL3_STATUS and RxComplete symbols something different, and their implementation of an xxx_rx function may be different, but the skeleton will be very close to the one shown here. In vortex_interrupt, the driver reads from the device the reasons for the interrupt and stores it into status. Network devices can generate an interrupt for different reasons, and several reasons can be grouped together in a single interrupt. If RxComplete (a symbol specially defined by this driver to mean a new frame has been received) is among those reasons, the code invokes vortex_rx.[*] During its execution, interrupts are disabled for the device. However, the driver can read a hardware register on the card and find out if in the meantime, a new interrupt was posted. The IntLatch flag is true when a new interrupt has been posted (and it is cleared by the driver when it is done processing it).
vortex_interrupt keeps processing incoming frames until the register says there is an interrupt pending (IntLatch) and that it is due to the reception of a frame (RxComplete). This also means that only multiple occurrences of RxComplete interrupts can be handled in one shot. Other types of interrupts, which are much less frequent, can wait. Finallyhere is where good citizenship entersthe loop terminates if it reaches the maximum number of input frames that can be processed, stored in work_done. This driver uses a default value of 32 and allows that value to be tuned at module load time. ![]() |