19.1. Main IPv4 Data Structures
This section introduces the major data structures used by the IPv4 protocol. You can refer to Chapter 23 for a detailed description of their fields.
I have not included a picture to show the relationships among the data structures because most of them are independent and do not keep cross-references.
IP header. The meaning of its fields has already been covered in the section "IP Header" in Chapter 18.
This structure, defined in include/linux/ip.h, represents the options for a packet that needs to be transmitted or forwarded. The options are stored in this structure because it is easier to read than the corresponding portion of the IP header itself.
This structure combines various pieces of information needed to transmit a packet.
Collection of fragments of an IP packet. See the section "Organization of the IP Fragments Hash Table" in Chapter 22.
The kernel keeps an instance of this structure for each remote host it has been talking to in the recent past. In the section "Long-Living IP Peer Information" in Chapter 23 you will see how it is used. All instances of inet_peer structures are kept in an AVL tree, a structure optimized for frequent lookups.
The Simple Network Management Protocol (SNMP) employs a type of object called a Management Information Base (MIB) to collect statistics about systems. A data structure called ipstats_mib keeps statistics about the IP layer
. The section "IP Statistics" in Chapter 23 covers this structure in more detail.
The in_device structure stores all the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig or ip command. This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get. The difference between those two functions is that the first one takes care of all the necessary locking, and the second one assumes the caller has taken care of it already.
Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure.
The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device.
When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with several other fields.
The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device. There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt). The meanings of its fields are covered in Chapters 28 and 36.
While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host.
The cork structure is used to handle the socket CORK option
. We will see in Chapter 21 how its fields are used to maintain some context information across consecutive invocations of ip_append_data and ip_append_page to handle data fragmentation.
19.1.1. Checksum-Related Fields from sk_buff and net_device Structures
We saw the routines used to compute the IP and L4 checksums
in the section "Checksums" in Chapter 18. In this section, we will see what fields of the sk_buff buffer structure are used to store information about checksums, how devices tell the kernel about their hardware checksumming capabilities, and how the L4 protocols use such information to decide whether to compute the checksum for ingress and egress packets or to let the network interface cards (NICs) do it.
Because the IP checksum is always computed and verified in software by the kernel, the next subsections concentrate on L4 checksum handling and issues.
22.214.171.124. net_device structure
The net_device->features field specifies the capabilities of the device. Among the various flags that can be set, a few are used to define the device's hardware checksumming capabilities. The list of possible features is in include/linux/netdevice.h inside the definition of net_device itself. Here are the flags used to control checksumming:
The device is so reliable that there is no need to use any L4 checksum. This feature is enabled, for instance, on the loopback device.
The device can compute the L4 checksum in hardware, but only for TCP and UDP over IPv4.
The device can compute the L4 checksum in hardware for any protocol. This feature is less common than NETIF_F_IP_CSUM.
126.96.36.199. sk_buff structure
The two fields skb->csum and skb->ip_summed have different meanings depending on whether skb points to a received packet or to a packet to be transmitted out.
When a packet is received, skb->csum may hold its L4 checksum. The oddly named skb->ip_summed field keeps track of the status of the L4 checksum. The status is indicated by the following values, defined in include/linux/skbuff.h. The following definitions represent what the device driver tells the L4 layer. Once the L4 receive routine receives the buffers, it may change the initialization of skb->ip_summed.
The checksum in csum is not valid. This can be due to various reasons:
The device does not provide hardware checksumming.
The device computed the hardware checksums and found the frame to be corrupted. At this point, the device driver could discard the frame directly. But some device drivers prefer to set ip_summed to CHECKSUM_NONE and let the software compute and verify the checksum again. This is unfortunate, because after all of the overhead of receiving the packet, all that the kernel does is recheck the checksum and discard the packet (see e1000_rx_checksum in drivers/net/e1000/e1000_main.c). Note that if the input frame is to be forwarded, the router should not discard it due to a wrong L4 checksum (a router is not supposed to look at the L4 checksum). It will be up to the destination host to do it. This is another reason why device drivers do not discard frames that fail the L4 checksum, but let the L4 receive routine verify them.
The checksum needs to be recomputed and reverified. See the section "Changes to the L4 Checksum" in Chapter 18 for the most common reasons.
The NIC has computed the checksum on the L4 header and payload and has copied it into the skb->csum field. The software (i.e., the L4 receive routine) needs only to add the checksum on the pseudoheader to skb->csum and to verify the resulting checksum. This flag can be considered a special case of the following flag.
The NIC has computed and verified the checksum on the L4 header and checksum, as well as on the pseudoheader (the checksum on the pseudoheader may optionally be computed by the device driver in software), so the software is relieved from having to do any L4 checksum verification.
CHECKSUM_UNNECESSARY can also be set, for example, when the probability of an error is very low and it would be a waste of time and CPU power to compute and verify the L4 checksum. One example is the loopback device: since the packets sent through this virtual device never leave the local host, the only possible errors would be due to faulty RAM or bugs in the operating system. This option can therefore be used with such special devices, but the standard behavior is to compute the checksum of each received packet and discard corrupted packets at the receiving end.
When a packet is transmitted, csum represents a pointer (or more accurately, an offset) to the place inside the buffer where the hardware card has to put the checksum it will compute, not the checksum itself. This field is therefore used during packet transmission only if the checksum is calculated in hardware. This interaction between L4 and L2, bypassing L3, introduces a couple of additional problems to deal with. For example, a feature such as Network Address Translation (NAT) that manipulates the fields of the IP header used by the L4 layer to compute the so-called checksum on the pseudoheader would invalidate that data structure (see the section "Changes to the L4 Checksum" in Chapter 18).
As in the case of reception, ip_summed represents the status of the L4 checksum. The field is used by the L4 protocols to tell the device whether it needs to take care of checksumming. In particular, this is the meaning of ip_summed during transmissions:
The protocol has already taken care of the checksum; the device does not need to do anything. When you forward an ingress frame, the L4 checksum is already ready because it has been computed by the sender host; therefore, there is no need to compute it. See ip_forward in Chapter 20. When ip_summed is set to CHECKSUM_NONE, csum is meaningless.
The protocol has stored into its header the checksum on the pseudoheader only; the device is supposed to complete it by adding the checksum on the L4 header and payload.
ip_summed does not use the CHECKSUM_UNNECESSARY value when transmitting packets (it would be equivalent to CHECKSUM_NONE).
While the feature flags NETIF_F_XXX_CSUM are initialized by the device driver when the NIC is enabled, the CHECKSUM_XXX flags have to be set for every sk_buff buffer that is received or transmitted. At reception time, it is the device driver that initializes ip_summed correctly based on the NETIF_F_XXX_CSUM device capabilities.
At transmission time, the L3 transmission APIs initialize ip_summed based on the checksumming capabilities of the egress device, which can be derived from the routing table: the routing table cache entry that matches the destination includes information about the egress device, and therefore its checksumming capabilities (see ip_append_data for an example).
Given the meaning of the skb->csum and skb->ip_summed fields and the CHECKSUM_HW flag previously described, you can study, for example, how TCPv4 takes care of the checksum on ingress segments in tcp_v4_checksum_init, and the checksum of egress segments in tcp_v4_send_check.