21.1. Key Functions That Perform TransmissionThe two functions listed at the top left of Figure 18-1 in Chapter 18 appear in Figure 21-1, classified by the L4 protocols that invoke them. The reason for two sets of functions is that the right-side L4 protocols (TCP and the Stream Control Transmission Protocol, or SCTP) do a lot of work to prepare for fragmentation; that leaves less work for the IP layer. In contrast, raw IP and the other protocols listed on the left side leave all of the work of fragmentation up to the IP layer. Figure 21-1 shows the main functions that lie between transmission at L4 and the last step of L3, which is invoking the neighbor function discussed in Chapter 27. At the top of the figure, the most common L4 protocols are shown. UDP and ICMP call one set of L3 functions to carry out transmission, whereas TCP and SCTP call another. When the L3 functions described in this chapter finish their work, they pass packets to dst_output. As for raw IP, when it uses the IP_HDRINCL option it is completely responsible for preparing the IP header, so it bypasses the functions described in this chapter and calls dst_output directly. See the section "Raw Sockets" for more details. The Internet Group Management Protocol (IGMP) also makes a direct call to dst_output (after initializing the IP header on its own). Thus, fragmentation is handled by the two sets of functions as follows:
Other functions are also used during transmission in specific contexts:
These will be pretty easy to understand after you understand how the functions in Figure 21-1 work. It is also possible for an L4 protocol to call dst_output directly; IGMP and RawIP are two protocols that do it (see the section "Raw Sockets"). In this chapter, I briefly cover ip_queue_xmit, but spend more time on ip_append_data/ip_push_pending_frames because they are key parts of the complex task of fragmentation. 21.1.1. Multicast TrafficAs shown in Figure 18-1 in Chapter 18, the egress paths followed by transmitted multicast and unicast traffic are similarmore similar than for the ingress path. I do not go into detail about multicast in this book, but in this chapter I will point out some differences between unicast and multicast during transmission. For instance, in the section "Building the IP header" we will see that the TTL is initialized differently for multicast traffic. The same is true when forwarding packets. 21.1.2. Relevant Socket Data Structures for Local TrafficA BSD socket is represented in Linux with an instance of a socket structure. This structure includes a pointer to a sock data structure, which is where the network layer information is stored. The sock data structure is pretty big but is well documented in include/net/sock.h. The sock data structure is actually allocated as part of a bigger structure that is specific to the protocol family; for PF_INET sockets the structure is inet_sock, defined in include/linux/ip.h. The first field of inet_sock is a sock instance, and the rest stores PF_INET private information, such as the source and destination IP addresses, the IP options, the packet ID, the TTL, and cork (discussed next). struct inet_sock { struct sock sk; ... ... ... struct { ... ... ... } cork; } Given a pointer to a sock data structure, the IP layer uses the inet_sk macro to cast the pointer to the outer inet_sock data structure. In other words, the base address of the inet_sock and sock structures is the same, a feature commonly exploited in C programs that deal with complex, nested structures. The inet_sock's cork field plays an important role in ip_append_data and ip_append_page: it stores the context information needed by those two functions to do data fragmentation correctly. Among the various information it contains are the options in the IP header (if any) and the fragment length. Whenever a transmission is generated locally (with only a few exceptions), each sk_buff buffer is associated with its sock instance and is linked to it with skb->sk. Different functions are used to set and read the value of the fields of the sock and inet_sock structures. Some of them are called by the functions in Figure 21-1. As far as this chapter is concerned, we need to understand the meaning of only a few of them:
Another data structure that appears in many of the functions in this chapter is the routing table cache entry associated with the packet, rtable. Many functions refer to it through a variable named rt. It contains information such as the outgoing device, the MTU of the outgoing device, and the next hop gateway. This structure is initialized by ip_route_output_flow and is described in Chapter 36. 21.1.3. The ip_queue_xmit Functionip_queue_xmit is the function currently used by TCP and SCTP. It receives only two input parameters, and all the information needed to process the packet is accessible (directly or indirectly) through skb. int ip_queue_xmit(struct sk_buff *skb, int ipfragok) Here is what the parameters mean:
The socket associated with skb includes a pointer named opt that refers to a structure we saw in the section "Option Parsing" in Chapter 19. The latter structure contains the options in the IP header in a format that makes them easier for functions at the IP layer to access. This structure is kept in the socket structure because it is the same for every packet sent through that socket; it would be wasteful to rebuild the information for every packet. struct sock *sk = skb->sk; struct inet_sock *inet = inet_sk(sk); struct ip_options *opt = inet->opt; Among the fields of the opt structure are offsets to the locations in the header where functions can store timestamps and IP addresses requested by IP options. Note that the structure does not cache the IP header itself, but only data that tells us what to write into the header, and where. 21.1.3.1. Setting the routeIf the buffer is already assigned the proper routing information (skb->dst), there is no need to consult the routing table. This is possible under some conditions when the buffer is handled by the SCTP protocol: rt = (struct rtable *) skb->dst; if (rt != NULL) goto packet_routed; In other cases, ip_queue_xmit checks whether a route is already cached in the socket structure and, if one is available, makes sure it is still valid (this is done by _ _sk_dst_check): rt = (struct rtable *)_ _sk_dst_check(sk, 0); If the socket does not already have a route for the packet cached, or if the one the IP layer has been using so far has been invalidated in the meantime, such as by an update from a routing protocol, ip_queue_xmit needs to look for a new route with ip_route_output_flow and store the result in the sk data structure. The destination is represented by the daddr variable. First, this variable is set to the final destination of the packet (inet->daddr), which is the proper value if the IP header includes no Source Route option. However, ip_queue_xmit then checks for a Source Route option and, if one exists, sets the daddr variable to the next hop in the source route (inet->faddr). In case of a Strict Source Route option, the next hop found by ip_route_output_flow has to match exactly the next hop in the source route list. if (rt == NULL) { u32 daddr; daddr = inet->daddr; if(opt && opt->srr) daddr = opt->faddr; { struct flowi fl = { .oif = sk->sk_bound_dev_if, .nl_u = { .ip4_u = { .daddr = daddr, .saddr = inet->saddr, .tos = RT_CONN_FLAGS(sk) } }, .proto = sk->sk_protocol, .uli_u = { .ports = { .sport = inet->sport, .dport = inet->dport } } }; if (ip_route_output_flow(&rt, &fl, sk, 0)) goto no_route; } _ _sk_dst_set(sk, &rt->u.dst); tcp_v4_setup_caps(sk, &rt->u.dst); } Refer to Chapter 36 for details on the flowi data structure, and to Chapter 33 for details on the ip_route_output_flow routine. The call to tcp_v4_setup_caps saves the features provided by the egress device in the socket sk; we can ignore this call during our discussion. The packet is dropped if ip_route_output_flow fails. If the route is found, it is stored with _ _sk_dst_set in the sk data structure so that it can be used directly next time, and the routing table does not have to be consulted again. If for some reason the route is invalidated again, a future call to ip_queue_xmit will use ip_route_output_flow once more to find a new one. As the following code shows, the packet is dropped if the IP header carries the Strict Source Routing option, and the next hop provided by that option does not match the next hop returned by the routing table:[*]
skb->dst = dst_clone(&rt->u.dst); packet_routed: if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway) goto no_route; dst_clone is called to increment the reference count on the data structure assigned to skb->dst. When a packet is dropped, an error code is returned to the upper layer and the associated SNMP statistics are updated. Note that in this case the function does not need to send any ICMP to the source (we are the source). Instead, if everything is OK, we have all the information needed to transmit the packet and it is time to build the IP header. 21.1.3.2. Building the IP headerSo far, skb contains only the IP payloadgenerally the header and payload from the L4 layer, either TCP or SCTP. These protocols always allocate buffers whose size will be able to handle worst case scenarios with regards to the addition of the lower layer headers. In this way they reduce the chances that IP or any other lower layer will have to do memory copies or buffer reallocation to handle the addition of headers that do not fit the free space. When ip_queue_xmit receives skb, skb->data points to the beginning of the L3 payload, which is where the L4 protocol writes its own data. The L3 header lies before this pointer. So skb_push is used here to move skb->data back so that it points to the beginning of the L3 or IP header; the result is illustrated in Figure 19-2 in Chapter 19. iph is also initialized to the pointer at that location. iph = (struct iphdr *) skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0)); The next block initializes a bunch of fields in the IP header. The first assignment sets the value of three fields (version, ihl and tos) in one shot, because they share a common 16 bits. Thus, the statement sets the Version in the header to 4, the Header Length to 5, and the TOS to inet->tos. Some of the values used to initialize the IP header are taken from sk and some others from rt, both of which were described earlier in the section "Relevant Socket Data Structures for Local Traffic." *((_ _u16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); iph->tot_len = htons(skb->len); if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok) iph->frag_off = htons(IP_DF); else iph->frag_off = 0; iph->ttl = ip_select_ttl(inet, &rt->u.dst); iph->protocol = sk->sk_protocol; iph->saddr = rt->rt_src; iph->daddr = rt->rt_dst; skb->nh.iph = iph; If the IP header contains options, the function needs to update the Header Length field iph->length, which was previously initialized to its default value, and then call ip_options_build to take care of the options. ip_options_build uses the opt variable, previously initialized to inet->opt, to add the required option fields (such as timestamps) to the IP header. Note that the last parameter to ip_options_build is set to zero, to specify that the header does not belong to a fragment (see the section "IP Options" in Chapter 19). if(opt && opt->optlen) { iph->ihl += opt->optlen >> 2; ip_options_build(skb, opt, inet->daddr, rt, 0); } mtu = dst_pmtu(&rt->u.dst); Then ip_select_ident_more sets the IP ID in the header based on whether the packet is likely to be fragmented (see the section "Selecting the IP Header's ID Field" in Chapter 23), and ip_send_check computes the checksum on the IP header. skb->priority is used by Traffic Control to decide which one of the outgoing queues to enqueue the packet in; this in turn helps determine how soon it will be transmitted. The value in this function is taken from the sock structure, whereas in ip_forward (which manages nonlocal traffic and therefore does not have a local socket) its value is derived from a conversion table based on the IP TOS value (see the section "ip_forward Function" in Chapter 20). ip_select_ident_more(iph, &rt->u.dst, sk, skb_shinfo(skb)->tso_segs); ip_send_check(iph); skb->priority = sk->sk_priority; Finally, Netfilter is called to see whether the packet has the right to jump to the following step (dst_output) and continue transmission: return NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, dst_output); 21.1.4. The ip_append_data FunctionThis is the function used by those L4 protocols that want to buffer data for transmission. As stated earlier in this chapter, this function does not transmit data, but places it in conveniently sized buffers for later functions to form into fragments (if necessary) and transmit. Thus, it does not create or manipulate any IP header. To flush and transmit the data buffered by ip_append_data, the L4 layer has to explicitly call ip_push_pending_frames, which also takes care of the IP header. If the L4 layer wants fast response time, it might call ip_push_pending_frames after each call to ip_append_data. But the two functions are provided so that the L4 layer can buffer as much data as possible (up to the size of the PMTU) and then send it at once to be efficient. As one consequence of its role in preparing packets, ip_append_data buffers data only up to the maximum size of an IP packet. As explained in the section "Packet Fragmentation/Defragmentation" in Chapter 18, this is 64 KB. The main tasks of ip_append_data are:
Given the more complex job of ip_append_data, compared to ip_queue_xmit, its more complex prototype should not come as a surprise: int ip_append_data(struct sock *sk, int getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb), void *from, int length, int transhdrlen, struct ipcm_cookie *ipc, struct rtable *rt, unsigned int flags) Here is the meaning of the input parameters:
ip_append_data is a long and complex function. The presence of numerous local variables defined, often with similar names, make it hard to follow. We will therefore break it down into the main steps. Given that there are many different combinations of possible outputs, based on the considerations listed near the beginning of this section, we will focus on a few common cases. By the end, you should be able to derive the other cases by yourself. The next few sections describe what the outcome of ip_append_data should be. After that come several sections describing the initial tasks of the function, finishing with a description of its main loop. The labels hh_len, exthdrlen, fraghdrlen, TRailer_len, copy, and length in Figures 21-2 through 21-7 are either input parameters to ip_append_data or local variables used by ip_append_data (in particular, the value of copy shown in the figures is the one passed to getfrag). All of them are expressed in bytes. The labels X, Y, S, S1, and S2 represent the size of a data block expressed in bytes. 21.1.4.1. Basic memory allocation and buffer organization for ip_append_dataIt is important to understand how the output from ip_append_datathe fragments to be turned into IP packetsis organized in memory. This section and the following two sections cover the data structures that organize the output data and how they are used. The same explanation applies to data formatted by the L4 layer and passed to ip_queue_xmit: this is done, for instance, by TCP instead of using ip_append_data. In every case, the buffers are eventually handed to dst_output, which appears near the center of Figure 18-1 in Chapter 18. Let's see a few examples. ip_append_data can create one or more sk_buff instances, each representing a distinct IP packet (or IP fragment). This is true regardless of how the data is stored in sk_buff (i.e., regardless of whether it is fragmented). Suppose we want to transmit an amount of data that lies within the PMTU (that is, it does not need to be fragmented). Let's also assume that because of the configuration of our host, we need to apply at least one of the protocols of the IPsec suite. Finally, let's suppose for the sake of simplicity that we are not trying to achieve memory optimizations in the way we allocate buffers. The results of ip_append_data (shown in Figure 21-2) in this case are as follows:
By reserving the space needed for all the protocols and layers that will come after the L4 layer, we eliminate the need for time-consuming memory manipulation later. Note also that the pointers to some of the headers (such as h.raw and nh.raw) are initialized; later the associated protocols can fill in their part. The only portion of the packet that is filled in by ip_append_data is the L4 payload. Other parts will be filled in as follows:
Figure 21-2. IP packet that does not need fragmentation, with IPsec![]() Part VI covers the L2 part of the header. Now let's take a slightly more complex example that requires fragmentation. From the previous example, let's remove IPsec and increase the payload size so that it exceeds the PMTU. Figure 21-3 shows the output.[*]
Figure 21-3. Fragmentation without Scatter/Gather I/O, no MSG_MORE![]() The object on the bottom left is the buffer that ip_append_data receives in input, and length is another of ip_append_data's input parameters. Two buffers created by the function lie to the right. Note that the first contains a fragment that has the maximum size (PMTU), and the second contains leftover data. ip_append_data creates as many buffers as necessary based on the PMTU; it happens here that a second one holds all the remaining payload, and that it is smaller than the PMTU. We said previously that ip_append_data will not transmit anything; it just creates buffers to be used later for packet fragments. This means that the L4 layer can potentially invoke ip_append_data again for either of the previous examples and add more data. Let's take the second example and show what happens. Since the second buffer is full, we are forced to allocate a new buffer. This might end up with suboptimal fragmentation; it would be better to have every fragment except the last one fill up to the size of the PMTU. One simple solution to achieve optimal fragmentation, at this point, is to allocate another buffer of maximum size, copy the data there from the second buffer, delete the second buffer, and merge the new data into the new buffer. If there is not enough space, we can allocate a third buffer. But this approach does not offer good performance. It vitiates the essential reason for doing data fragmentation before calling ip_fragment (shown in Figure 18-1 in Chapter 18), which is to avoid extra memory copies. Now it should be clear why the MSG_MORE flag introduced in the section "The ip_append_data Function" can be useful. For example, if in the second example, we knew a second call would be coming, we would have allocated the second buffer with the maximum size directly, producing the output in Figure 21-4 (note that the size of the L2 header hh_len is not included in the PMTU). If ip_append_data is called again before ip_push_pending_frames, it will first try to fill in the empty space in the second buffer in Figure 21-4 before allocating a third. Figure 21-4. Fragmentation without Scatter/Gather I/O, MSG_MORE![]() 21.1.4.2. Memory allocation and buffer organization for ip_append_data with Scatter Gather I/OSometimes it is actually possible to add data to a fragment even if it has not been allocated with the maximum size. That is possible when the device supports Scatter/Gather I/O . This simply means that the L3 layer leaves data in the buffers where the L4 layer placed it, and lets the device combine those buffers to do the transmission. The advantage of Scatter/Gather I/O is that it reduces the overhead of allocating memory and copying data. Consider this: an upper layer may generate many small items of data in successive operations and the L4 layer may store them in different buffers of kernel memory. The L3 layer is then asked to transmit all of these items in one IP packet. Without Scatter/Gather I/O, the L3 layer has to copy the data into new buffers to make a unified packet. If the device supports Scatter/Gather I/O, the data can stay right where it is until it leaves the host. When Scatter/Gather I/O is in use, the memory area to which skb->data points is used only the first time. The following chunks of data are copied into pages of memory allocated specifically for this purpose. Figures 21-5 and 21-6 compare how the data received by ip_append_data in its second invocation is saved when Scatter/Gather I/O is enabled, versus when it is disabled:
Some ancillary data structures support Scatter/Gather I/O. Each buffer except the first (which is allocated in the same way as when there is no support for Scatter/Gather I/O) is stored in skb_shinfo(skb)->frags. These can be found through pointers in the familiar sk_buff structure. As we saw in Chapter 2, each sk_buff structure includes a field of type skb_shared_info, which can be accessed with the macro skb_shinfo. This structure can be used to increase the size of the buffer by adding memory areas that can be located anywhere, not necessarily adjacent to one other. The nr_frags field helps the IP layer remember how many Scatter/Gather I/O buffers hang off of this packet. Note that this field counts Scatter/Gather I/O buffersnot IP fragments, as the name might suggest. Now we can look at why the kernel needs special support on the device side to use this kind of buffer representation: to be able to refer to memory areas that are not contiguous but whose content is supposed to represent a contiguous data fragment, the device must be able to handle that kind of buffer representation. Note that Figure 21-7 shows the simple example where there is one page that contains two adjacent memory areas. But the fragments could easily be nonadjacent, either within a single page or on different pages. Figure 21-5. ip_append_data with Scatter/Gather I/O![]() Each element of the frags array is represented by an skb_frag_t structure, which includes a pointer to a memory page, an offset relative to the beginning of the page, and the size of the fragment. Note that since the two fragments in Figure 21-7 are located within the same memory page, their page pointer points to the same memory page. The maximum number of fragments is MAX_SKB_FRAGS, which is defined based on the maximum size of an IP packet (64 KB) and the size of a memory page (which is defined on a per-architecture basis and whose default value on an i386 is 4 KB). Figure 21-6. ip_append_data without scatter/gather I/O![]() You can find the definitions of all the previously mentioned structures in include/linux/sk_buff.h. Figure 21-7 shows the case where there is only one page, but since there could be several pages, the elements of the frags array include the page pointer to the proper page. A fragment cannot span two pages. When the size of a new fragment is bigger than the amount of free space in the current page, the fragment is split into two parts: one goes to the already existent page and fills it, and the second part goes into a new page. Figure 21-7. Multiple fragments with Scatter/Gather I/O![]() One important detail to keep in mind is that Scatter/Gather I/O is independent from IP data fragmentation. Scatter/Gather I/O simply allows the code and hardware to work on nonadjacent memory areas as if they were adjacent. Nevertheless, each fragment must still respect the limit on its maximum size (the PMTU). This means that even if PAGE_SIZE is bigger than the PMTU, a new sk_buff will be created when the data in sk_buff (pointed to by skb->data) plus the ones referenced with frags reaches the PMTU. Note also that the same page can hold fragments of data for different IP fragments, as shown in Figure 21-8. Each fragment of data added to the memory page increments the page's reference count. When the IP fragments are finally sent out and the data fragments in the page are released, the reference count is decreased accordingly and the memory page is released (see skb_release_data, which is called indirectly by kfree_skb). The sock structure on the top left of Figure 21-8 includes both a pointer to the last page (sk_sndmsg_page) and an offset (sk_sndmsg_off) inside that page where the next data fragment should be placed. Figure 21-8. Memory page shared between IP fragments![]() 21.1.4.3. Key routines for handling fragmented buffersTo understand the functions described in this chapter and the ones in Chapter 22, you need to be familiar with the key buffer manipulation routines introduced in Chapter 2, and the following ones:
Figure 21-9 shows a couple of examples. Note that skb->len includes the data fragments in frags (updated in ip_append_data) and in frag_list (updated in ip_push_pending_frames). I have omitted the details about the protocol headers because they are not necessary for our discussion. Figure 21-9. Key functions for fragmented buffers: (a) Scatter/Gather; (b) no Scatter/Gather![]() I also would like to stress this point once more: the data in the frags vector is an extension to the data in the main buffer, and the data in frags_list represents independent buffers (i.e., each one will be transmitted independently as a separate IP fragment). 21.1.4.4. Further handling of the buffersWhenever ip_append_data allocates a new sk_buff structure to handle a new data fragment (which will become a new IP fragment), it queues the fragment onto a queue called sw_write_queue that is associated with ip_append_data's input socket sk. This queue is the output of the function. Later functions need only add the IP headers to the data fragments and push them down to the L2 layer (to the dst_output routine, to be exact). The sk_write_queue list is managed as a First In, First Out (FIFO) queue, as follows:
Now that we know what kind of output ip_append_data produces, we can look at the code. Once again, keep in mind that the L4 layer can call ip_append_data several times before flushing the buffers with ip_push_pending_frames. Let's suppose that UDP issued three calls to ip_append_data with the following payload sizes: 300, 250, and 200 bytes. Let's also assume the PMTU is 500 bytes. It should be clear that if UDP had sent a single payload of 750 bytes, the IP layer would have created a first fragment of 500 bytes and a second one of 250 bytes.[*] However, the application using that UDP socket might actually want to send three distinct IP packets of sizes 300, 250, and 200 bytes. ip_append_data can be told which way to behave. If the application behind the UDP socket prefers to obtain higher throughput, it uses the MSG_MORE flag to tell ip_append_data to create maximum-size fragments (500 bytes) and the result would be a first fragment of 500 bytes and a second one of 250 bytes. If it does not signal the preference for such buffering, UDP transmits each payload individually (see the section "Putting Together the Transmission Functions").
21.1.4.5. Setting the contextThe first block of the ip_append_data function initializes some local variables and possibly changes some of the input parameters. The exact work done depends on whether the function is creating the first IP fragment (in which case the sk_write_queue queue would be empty) or a later one within a packet. With the first element, ip_append_data initializes inet->cork and inet with fields that will be used by the following invocation of ip_append_data (and by ip_push_pending_frames). Among the information saved is the IP options and the routing table cache entry. Caching them saves time during subsequent calls to ip_append_data for the same packet, but is not strictly necessary because ip_append_data's caller will pass the data again in all of the following calls. if (skb_queue_empty(&sk->sk_write_queue)) { opt = ipc->opt; if (opt) { if (inet->cork.opt == NULL) { inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation); if (unlikely(inet->cork.opt == NULL)) return -ENOBUFF; } memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen); inet->cork.flags |= IPCORK_OPT; inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); inet->cork.fragsize = mtu = dst_pmtu(&rt->u.dst); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; sk->sk_sndmsg_off = 0; if ((exthdrlen = rt->u.dst.header_len) != 0) { length += exthdrlen; transhdrlen += exthdrlen; } } else { rt = inet->cork.rt; if (inet->cork.flags & IPCORK_OPT) opt = inet->cork.opt; transhdrlen = 0; exthdrlen = 0; mtu = inet->cork.fragsize; } To understand the rest of the function, you need to understand the meaning of the following key variables. Some of them are received in input by ip_append_data; refer to the section "The ip_append_data Function" for their descriptions. It can also be useful to refer back to Figures 21-2 through 21-8.
The way length, exthdrlen, and transhdrlen are initialized may be confusing. I'll explain why their values are changed under some conditions. As we have already seen, only the first fragment needs to include the transport header and the optional external headers. Because of this, transhdrlen and exthdrlen are zeroed after creating the first fragment. As we will see, this can be done right at the beginning of the function if sk_write_queue is not empty, or inside the big while loop before starting a second iteration. Because of this initialization, the value of transhdrlen is used by the function to distinguish between the first fragment and the following ones:
The same logic cannot be applied to exthdrlen, because the L4 header is needed for every IP packet, but many have no external headers because they don't use special features such as IPsec. The variables initialized here have several important uses later:
21.1.4.6. Getting ready for fragment generationAs we will see later, the amount of data copied into each generated fragment may change from fragment to fragment. However, each fragment always includes a fixed portion for the L2 and L3 headers. Figures 21-2 through 21-8 all show this reserved portion. Before proceeding, the function defines the following three local variables: hh_len = LL_RESERVED_SPACE(rt->u.dst.dev); fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; hh_len is the length of the L2 header. When reserving space for all the headers that precede IP in the buffer, ip_append_data needs to know how much space is needed by the L2 header. This way, when the device driver initializes its header, it will not need to reallocate space or move data inside the buffer to make space for the L2 header. fraghdrlen is the size of the IP header, including the IP options, and maxfraglen is the maximum size of an IP fragment based on the route PMTU. As explained in the section "Packet Fragmentation/Defragmentation" in Chapter 18, the maximum size of an IP packet (header plus payload) is 64 KB. This applies not just to individual fragments, but also to the complete packet into which those fragments will be reassembled at the end. Thus, ip_append_data keeps track of all the data received for a particular packet and refuses to go over the 64 KB (0xFFFF) limit. if (inet->cork.length + length > 0xFFFF - fragheaderlen) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen); return -EMSGSIZE; } inet->cork.length += length; The last initialization is the checksum mode, the value of which is saved in skb->ip_summed. See the section "L4 checksum." 21.1.4.7. Copying data into the fragments: getfragip_append_data can potentially be used by any L4 protocol. One of its tasks is to copy the input data into the fragments it creates. Different protocols may need to apply different operations to the data copied. One example of such a specialized operation is the computation of the L4 checksum, which is not compulsory for some L4 protocols. Another distinguishing factor could be the origin of the data. This is user space for locally generated packets, and kernel space for forwarded packets or packets generated by the kernel (e.g., ICMP messages). Instead of having one shared function that takes care of all the possible combinations of protocols and optional operations to apply, it is easier and cleaner to have multiple small functions tailored to each protocol's need. To keep ip_append_data as generic as possible, it allows each protocol to specify the function to use to copy the data by means of the input parameter getfrag. In other words, ip_append_data uses getfrag to copy the input data into the buffers; the result of this copying consists of the memory areas labeled "L4 payload" in Figures 21-2 through 21-9. Table 21-1 lists the functions used by the most common L4 protocols that invoke ip_append_data. Another function, ip_reply_glue_bits, is used by ip_send_reply (see the section "Key Functions That Perform Transmission"). getfrag receives four input parameters (from, to, offset, and len), and simply copies len bytes from from to to+offset, taking into account that from could be a pointer into user-space memory and thus has to be handled accordingly (it may require translation from user to kernel memory). It also takes care of the L4 checksum: while copying data into the kernel buffer, it updates skb->csum according to the skb->ip_summed configuration.
In a situation where the origin of the getfrag function's inputuser space versus kernelis always the same, the function does not need to distinguish between the two cases. For example:
Let's take a closer look at the generic function ip_generic_getfrag: int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb) { struct iovec *iov = from; if (skb->ip_summed == CHECKSUM_HW) { if (memcpy_fromiovecend(to, iov, offset, len) < 0) return -EFAULT; } else { unsigned int csum = 0; if (csum_partial_copy_fromiovecend(to, iov, offset, len, &csum) < 0) return -EFAULT; skb->csum = csum_block_add(skb->csum, csum, odd); } return 0; } The section "sk_buff structure" in Chapter 19 explained the meaning of CHECKSUM_HW, and how skb->csum and skb->ip_summed are used. In the section "L4 checksum," we will see how ip_append_data decides whether the L4 checksum should be computed in hardware or software (or not computed at all). In the previous snapshot, you can see that ip_generic_getfrag uses two different functions to copy the data (memcpy_fromiovecend and csum_partial_copy_fromiovecend), based on whether the L4 checksum is going to be computed in hardware or must be computed in software. 21.1.4.8. Buffer allocationip_append_data chooses the size of the buffers to allocate based on:
The following piece of code decides the size of the buffer to allocate (alloclen) based on the two points just stated. The buffer is created with the maximum size (based on the PMTU) if more data is expected and if the device can't handle Scatter/Gather I/O. If either of those conditions is not true, the buffer is made just large enough to hold the current data. if ((flags & MSG_MORE) && !(rt->u.dst.dev->features&NETIF_F_SG)) alloclen = mtu; else alloclen = datalen + fragheaderlen; if (datalen == length) alloclen += rt->u.dst.trailer_len; Note that when ip_append_data generates the last fragment, it needs to take into account the presence of trailers (such as for IPsec). datalen is the amount of data to be copied into the buffer we are allocating. Its value was previously initialized based on three factors: the amount of data left (length), the maximum amount of data that fits into a fragment (fraghdrlen), and an optional carry from the previous buffer (fraggap). The last component, fraggap, requires an explanation. With the exception of the last buffer (which holds the last IP fragment), all fragments must respect the rule that the size of the payload of an IP fragment must be a multiple of eight bytes. For this reason, when the kernel allocates a new buffer that is not for the last fragment, it may need to move a piece of data (whose size ranges from 0 to 7 bytes) from the tail of the previous buffer to the head of the newly allocated one. In other words, fraggap is zero unless all of the following are true:
Figure 21-10 shows an example where fraggap is nonzero and alloclen has been initialized to mtu. Note that when the kernel moves the data from the current buffer, skb_prev, to the new one, skb, it also needs to adjust the L4 checksum on both skb_prev and skb (see the section "L4 checksum"). The figure shows the buffers as two flat memory areas for simplicity, but they both could be paged (as in Figure 21-5) and nonpages (as in Figure 21-6): the function used to move the fraggap area skb_copy_and_csum_bits can handle both formats. The same function also updates the L4 checksums. Figure 21-10. Respecting the 8-byte boundary rule on IP fragments![]() 21.1.4.9. Main loopThe while loop that potentially creates extra buffers may look more complex than it actually is. Figure 21-11 summarizes its job. Figure 21-11. ip_append_data function: main loop![]() Initially, the value of length represents the amount of data that the ip_append_data's caller wants to transmit. However, once the loop is entered, its value represents the amount of data left to handle. This explains why its value is updated at the end of each loop and why ip_append_data loops until length becomes zero. We already know that MSG_MORE indicates whether the L4 layer expects more data, and that NETIF_F_SG indicates whether the device supports Scatter/Gather I/O. These settings have no effect on the first task within the loop, which is to allocate and initialize sk_buff structures within the first if block inside the loop. Also, the first data fragment is always copied into the sk_buff area (see Figure 21-5(a) and Figure 21-6(a)). ip_append_data allocates a new sk_buff structure and queues it to sk_write_queue every time one of the following occurs:
The piece of code that precedes the loop takes care of the first case by forcing allocation when the queue is empty: if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) goto alloc_new_skb; The first part inside the loop handles the second case. First it initializes copy to the amount of space that is left in the current IP fragment: mtu - skb->len. If the data left to add (length) is greater than the amount of free space, copy, there is a need for one more IP fragment. In that case, copy is updated. To enforce the 8-byte boundary rule, copy is lowered to the closest 8-byte boundary. At this point, the kernel can decide whether it needs to allocate a new buffer (i.e., a new IP fragment). This is the logic associated with the if condition that compares copy against 0:
Every time a new loop ends, the function needs to move ahead the pointer to the data to copy (offset) and to update the amount of data left to copy (length). Once the fragment has been queued with _ _skb_queue_tail, the function may need to restart the loop if any data is left. 21.1.4.10. L4 checksumWe saw in the section "net_device structure" in Chapter 19 that the L3 and L4 checksums can be computed by the egress NIC when its device driver advertises that capability by setting the right flags in dev->features. In particular, skb->ip_summed (and eventually skb->csum) must be initialized to show whether the egress device provides support for L4 hardware checksumming. Refer to the aforementioned section for more details. Whether hardware checksumming can be used is decided when ip_append_data is called for the first fragment (i.e., transhdrlen is nonzero). Hardware checksumming is applicable only when all of the following conditions are met:
Hardware checksumming might also have to be turned off under other conditions. The first bullet in the previous list requires an explanation. Hardware checksumming does not work when the IP packet is fragmented (as in the example in Figure 21-3). However, because ip_append_data can be called several times before the actual transmission takes place (i.e., before ip_push_pending_frames is called), the IP layer may not know that fragmentation is required when ip_append_data is first called and therefore the initial decision is based only on the input data (length): if fragmentation is required based on length, hardware checksumming is not used. if (transhdrlen && length + fragheaderlen <= mtu && rt->u.dst.dev->features&(NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM) && !exthdrlen) csummode = CHECKSUM_HW; The local variable csummode initialized here will be assigned to skb->ip_summed on the first buffer. If there is a need for fragmentation and ip_append_data allocates more buffers accordingly (one for each IP fragment), skb->ip_summed on the subsequent buffers will be set to CHECKSUM_NONE. When getfrag is called to copy the data into the buffers, it also takes care of the L4 checksum if it is passed a buffer with skb->ip_summed initialized to CHECKSUM_NONE (see the section "Copying data into the fragments: getfrag"). Note that ip_append_data checksums only the L4 payloads. In the section "Changes to the L4 Checksum" in Chapter 18, we saw that the L4 checksum must include the L4 header as well as the so-called pseudoheader. If ip_push_pending_frames is called by the L4 layer when sk_write_queue has only one IP fragment and the egress device supports hardware checksumming, the L4 protocol only needs to initialize skb->csum to the right offset and the L4 header's checksum field with the pseudoheader checksum, as we saw in the section "sk_buff structure" in Chapter 19. If instead the egress device does not support hardware checksumming, or the latter is supported but cannot be used because sk_write_queue has more than one IP fragment, the L4 checksum must be computed in software. In this case, getfrag computes the partial checksums on the L4 payloads while copying data into the buffers, and the L4 protocol will combine them later to get the value to put into the L4 header. See the section "Putting Together the Transmission Functions" to see how UDP takes care of the L4 checksum before invoking ip_push_pending_frames. For an example of how a device driver instructs the NIC to compute the L4 hardware checksum when required, see the boomerang_start_xmit routine in drivers/net/3c59x.c and cp_start_xmit in drivers/net/8139cp.c. In both cases, you can also see how a paged skb is handled when setting up the DMA transfers. 21.1.5. The ip_append_page FunctionWe saw in the section "Copying data into the fragments: getfrag" that a transmission request from user space, with a call like sendmsg, requires a copy to move the data to transmit from user space to kernel space. This copy is made by the getfrag function passed as an input parameter to ip_append_data. The kernel provides user-space applications with another interface, sendfile, which allows applications to optimize the transmission and make the data copy. This interface has been widely publicized as "zero-copy" TCP/UDP . The sendfile interface can be used only when the egress device supports Scatter/Gather I/O . In this case, the logic implemented by ip_append_data can be simplified so that no copy is necessary (i.e., the data the user asked to transmit is left where it is). The kernel just initializes the frag vector with the location of the data buffer received in input, and takes care of the L4 checksum if needed. This simplified logic is what is provided by ip_append_page. While ip_append_data receives the location of the data with a void* pointer, ip_append_page receives the location as a pointer to a memory page and offset within it, which makes it straightforward to initialize one entry of frag. The only piece of code that differs from ip_append_data with regard to Scatter/Gather I/O is the following: i = skb_shinfo(skb)->nr_frags; if (len > size) len = size; if (skb_can_coalesce(skb, i, page, offset)) { skb_shinfo(skb)->frags[i-1].size += len; } else if (i < MAX_SKB_FRAGS) { get_page(page); skb_fill_page_desc(skb, i, page, offset, len); } else { err = -EMSGSIZE; goto error; } if (skb->ip_summed == CHECKSUM_NONE) { unsigned int csum; csum = csum_page(page, offset, len); skb->csum = csum_block_add(skb->csum, csum, skb->len); } When adding a new fragment to a page, ip_append_page TRies first to merge the new one with the previous fragment already in the page. To do that, it first checks, by means of skb_can_coalesce, whether the point where the new one should be added matches with the point where the last one ends. If merging is possible, all it has to do is update the length of the previous fragment already in the page to include the new data. When merging is not possible, the function initializes the new fragment with skb_fill_page_desc. In this case, it also increments the reference count on the page with get_page. The reference count must be incremented because ip_append_page uses the page it receives as input, and this page could potentially be used by someone else, too. ip_append_page is currently used by UDP only. We said that TCP does not use the ip_append_data and ip_push_pending_frames functions because it implements the same logic in tcp_sendmsg. The same applies to this zero-copy interface: TCP does not use ip_append_page, but implements the same logic in do_tcp_sendpage. Unlike UDP, TCP allows the application to use the zero-copy interface only if the egress device supports L4 hardware checksumming.[*]
21.1.6. The ip_push_pending_frames FunctionAs explained near the beginning of this chapter, ip_push_pending_frames works in tandem with ip_append_data and ip_append_page. When the L4 layer decides it is time to wrap up and transmit the fragments queued to sw_write_queue through ip_append_data or ip_append_page (either because of some protocol-specific criterion or because it is explicitly told by the higher-level application to send the data), it simply calls ip_push_pending_frames: int ip_push_pending_frames(struct sock *sk) The function receives a sock structure in input. It needs access to several fields, notably the pointer to the socket's sk_write_queue structure. We saw in the section "Memory allocation and buffer organization for ip_append_data with Scatter Gather I/O" that the data in the packet is organized differently in the sk_buff structure, depending on whether Scatter/Gather I/O is used. The code in this half queues all the buffers that follow the first one into a list named frag_list that is part of the first element, as shown in Figure 21-12, and updates the len and data_len fields of the buffer at the head of the list to account for all of the fragments. This last operation is performed because it is useful to the ip_fragment routine that comes later in the code path (see Figure 18-1 in Chapter 18, and see Chapter 22). As buffers are queued onto frag_list, they are cleared off of sk_write_queue. It requires very little time to create the new list (no data is copied; only pointers are changed) and the result is to free the sk_write_queue list, which therefore allows the L4 layer to consider the data transmitted. The data is now out of the hands of the L4 layer and completely under the care of the IP layer. Remember, as you look at Figure 21-12, that nr_frags reflects the number of Scatter/Gather I/O buffers, and not the number of IP fragments. Two points are worth mentioning about Figure 21-12:
After that, it is time to fill in the IP header. If there are multiple fragments, only the first is going to have its IP header filled in by ip_push_pending_frames; the others will be taken care of later (we will see how in Chapter 22). The setting of the TTL field of the IP header (iph->ttl) depends on whether the destination address is multicast. Usually, a smaller value is used for multicast traffic because multicasting is most often used to deliver streaming (and sometimes interactive) data such as audio and video that can become useless if it is received too late. The default values assigned to the TTL field for multicast and unicast packets are 1 and 64, respectively.[*]
if (rt->rt_type == RTN_MULTICAST) ttl = inet->mc_ttl; else ttl = ip_select_ttl(inet, &rt->u.dst); ... iph->ttl = ttl; Figure 21-12. (a) Before and (b) after removing buffers from the sk_write_queue queue![]() If there are IP options in the header, ip_options_build is used to take care of them. The last input parameter to ip_options_build is set to zero to tell the API that it is filling in the options of the first fragment. This distinction is needed because the first fragment's IP options are treated differently, as we saw in the section "IP Options" in Chapter 18. The length of the header is also updated to reflect the length of the options. if (inet->cork.flags & IPCORK_OPT) opt = inet->cork.opt; ... iph->ihl = 5; if (opt) { iph->ihl += opt->optlen>>2; ip_options_build(skb, opt, inet->cork.addr, rt, 0); } The Don't Fragment flag IP_DF of the IP header is set when the socket's configuration enforces the use of that flag on all packets (i.e., IP_PMTUDISC_DO), and when the route rt has PMTU enabled (i.e., IP_PMTUDISC_WANT) and not locked (see the definition of ip_dont_fragment):[*]
if (inet->pmtudisc != IP_PMTUDISC_DO) skb->local_df = 1 ... if (inet->pmtudisc == IP_PMTUDISC_DO || skb->len <= dst_mtu(&rt->u.dst) && ip_dont_fragment(sk, &rt->u.dst))) df = htons(IP_DF); ... iph->frag_off = df; The value just assigned to the df variable, reflecting the packet's Don't Fragment status, determines in turn how the IP packet ID is set. The section "Selecting the IP Header's ID Field" in Chapter 23 goes into more detail on how that ID is computed. if (!df) { _ _ip_select_ident(iph, &rt->u.dst, 0); } else { iph->id = htons(inet->id++); } skb->priority is used by Traffic Control to decide which one of the outgoing queues to enqueue the packet in. See the similar initialization by ip_queue_xmit in the section "Building the IP header." iph->version = 4; iph->tos = inet->tos; iph->tot_len = htons(skb->len); iph->protocol = sk->sk_protocol; iph->saddr = rt->rt_src; iph->daddr = rt->rt_dst; ip_send_check(iph); skb->priority = sk->sk_priority; skb->dst = dst_clone(&rt->u.dst); Before passing the buffer to dst_output to complete the transmission, the function needs to ask Netfilter permission to do so. Note that Netfilter is queried only once for all the fragments of a packet. In an earlier version of the kernel (2.4), Netfilter was queried for each fragment. This gave Netfilter the chance to filter IP packets with a higher granularity, but it also forced Netfilter to defragment and refragment packets in case there were filters that examined the L4 or higher levels. The overhead was judged too burdensome for the value it offered. Note that when dst_input is passed a list of sk_buff buffers (as opposed to a single buffer), as shown in Figure 21-12(b), only the first one gets its IP header initialized. We will see in Chapter 22 how such a list is taken care of by ip_fragment. err = NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, skb->dst->dev, dst_output); Before returning, the function clears the IPCORK_OPT field, which invalidates the contents of the cork structure. This is because later packets to the same destination reuse the cork structure, and the IP layer needs to know when old data should be thrown away. 21.1.7. Putting Together the Transmission FunctionsTo see how the functions we've been examining, ip_append_data and ip_push_pending_frames, work together, let's focus on a function called by the UDP layer, udp_sendmsg, and see how it calls them. int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len) { ... ... ... struct udp_opt *up = udp_sk(sk); ... ... ... int corkreq = up->corkflag || msg->msg_flags&MSG_MORE; ... ... ... err = ip_append_data(sk, ip_generic_getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, rt, corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags); if (err) udp_flush_pending_frames(sk); else if (!corkreq) err = udp_push_pending_frames(sk, up); The local flag corkreq is initialized based on multiple factors, and will be passed to ip_append_data to signal whether buffering should be used. Among those factors are:
These two flags have a comparable purpose. After some discussion over which was the best one, in the end both of them were made available in the kernel. udp_sendmsg first calls ip_append_data, and then forces the immediate transmission of the data with udp_push_pending_frames only if corkreq is false. In case ip_append_data failed for any reason, udp_sendmsg flushes the queue with udp_flush_pending_frames, which is a wrapper for the IP function ip_flush_pending_frames. Figure 21-13 shows the internals of udp_push_pending_frames. Note how the L4 checksum is handled according to the logic we saw in the section "L4 checksum." Figure 21-13. udp_push_pending_frames function![]() For an example of how to use ip_append_page, you can take a look at udp_sendpage. 21.1.8. Raw SocketsIt is possible for raw sockets (sockets using raw IP) to include the IP header in the data they pass to the IP layer. This means that the IP layer can be asked to send a piece of data that already includes an initialized IP header. To do this, raw IP uses the IP_HDRINCL (header included) option, which can be set, for instance, with the setsockopt system call (see the ip_setsockopt routine). When this option is set, neither ip_push_ pending_frames nor ip_queue_xmit is used. Raw IP directly invokes dst_output instead. See the raw_sendmsg and raw_send_hdrinc functions for examples. ![]() |