18.4. Packet Fragmentation/DefragmentationPacket fragmentation and defragmentation is one of the main jobs of the IP protocol. The IP protocol defines the maximum size of a packet as 64 KB, which comes from the fact that the len field of the header, which represents the size of the packet in bytes, is a 16-bit value. However, not many interface types can send packets of a size up to 64 KB. This means that when the IP layer needs to transmit a packet whose size is bigger than the MTU of the egress interface, it needs to split the packet into smaller pieces. We will see later in this chapter that the MTU used is not necessarily the one associated to the egress's device; it could be, for instance, the one associated with the routing table entry used to route the packet. The latter would depend on several factors, one of which is the egress device's MTU. Regardless of how the MTU is computed, the fragmentation process creates a series of equal-size fragments, as shown in Figure 18-10. The MF and OFFSET fields shown in the picture are described later in this section. If the MTU does not divide the original size of the packet exactly, the final fragment is smaller than the others. Figure 18-10. IP packet fragmentation![]() A fragmented IP packet is normally defragmented by the destination host, but intermediate devices that need to look at the entire IP packet may have to defragment it, too. Two examples of such devices are firewalls and Network Address Translation (NAT) routers. Some time ago, it was an acceptable solution for the receiver to allocate a buffer the size of the original IP packet and put fragments there as they arrived. In fact, the receiver might just allocate a buffer of the maximum possible size, because the size of the original IP packet was known only after receiving the last fragment. That simple approach is now avoided because it wastes memory, and a malicious attack could bring a router to its knees just by sending a burst of very small fragments that lie about their original size. Because every IP packet can be fragmented, and because each fragment can be further fragmented along the path for the same reason, there must be a way for the receiver to understand which IP packet each fragment belongs to, and at what position inside the original IP packet each fragment should be placed. The receiver must also be told the original size of the IP packet to know when it has received all of the fragments. Several other aspects have to be considered to accomplish fragmentation. When copying the IP header of the original packet into its fragments, the kernel does not copy all of the options, but only those with the copied field set, as described earlier in the section "IP Options." However, when the IP fragments are merged, the resulting IP packet will look like the original one and therefore include all the options again. Moreover, the IP checksum covers only the IP header (the payload is usually covered by the higher-layer protocols). When fragments are created, the headers are all different, so a checksum has to be computed for each one of them, and checked on the receiving side. 18.4.1. Effect of Fragmentation on Higher LayersFragmenting and defragmenting a packet takes both CPU time and memory. For a heavily loaded server, the extra resources involved may be quite significant. Fragmentation also introduces overhead in the bandwidth used for transmission, because each fragment has to contain both the L2 and L3 headers. If the size of the fragments is small, that overhead can be significant. Higher layers are theoretically unaware of when the L3 layer chooses to fragment a packet.[*]
However even if TCP and UDP are unaware of the fragmentation/defragmentation processes,[
When a sender decides to use a packet size smaller than its available MTU just to avoid fragmentation, it must also entail the same overhead of including extra headers that fragmentation requires. However, avoiding fragmentation by routers along the way reduces processing considerably along the route and therefore can be critical for improving response time. 18.4.2. IP Header Fields Used by Fragmentation/DefragmentationHere are the fields of the IP header that are used to handle the fragmentation/defragmentation process. We will see how they are used in Chapter 22.
18.4.3. Examples of Problems with Fragmentation/DefragmentationFragmentation is a pretty simple process: the node simply has to choose the right value to fit the MTU. It should not come as a surprise that most of the issues have to do with defragmentation. In the next two sections, we cover two of the most common issues: handling retransmissions and reassembling packets properly, along with the special problem of Network Address Translation (NAT). Another reason not to use fragmentation is that it is incompatible with congestion control algorithms. 18.4.3.1. RetransmissionsI said earlier that an IP packet cannot be delivered to the next-higher layer until it has been completely defragmented. However, this does not mean that fragments are kept in the host's memory indefinitely. Otherwise, it would be very easy to render a host unusable through a simple Denial of Service (DoS) attack. A fragment might not be received for several reasons: for instance, it might be dropped along the way by a router that has run out of memory to store it due to congestion, it might become corrupted and be discarded due to the CRC (error check), or it could be held up by a firewall because the firewall wants to view the header in the first fragment before forwarding any fragments. Therefore, each router and host has a timer that cleans up the resources used by the fragments of an IP packet if some fragments are not received within a given amount of time. If a sender could tell that a fragment was lost or dropped along the path, it would be nice if the sender could retransmit just the missing fragment. This is completely unfeasible to implement, though. A sender cannot know even whether its packet was fragmented by a router later on in the path, much less what the fragments are. So each sender must simply wait for a higher layer to tell it to resend an entire packet. A retransmitted packet does not reuse the same ID as the original. However, it is still possible for a host to receive copies of the same IP fragment with the same packet ID, so a host must be able to handle this situation. Note that the same fragment may be received multiple times even without retransmissions: a common example is when there's a loop at the L2 layer. We saw this case in Part IV. This waste provides another good reason to avoid fragmentation at the source and to try to use packet sizes that minimize the likelihood of fragmentation along the way if delays are bad for the application (e.g., in videoconferencing software). Since the kernel cannot swap its data out to disk (it swaps only user-space data), the memory waste due to handling fragments has a heavy impact on router performance. Linux puts a limit on the amount of memory usable by fragments, as described in the section "Tuning via /proc Filesystem" in Chapter 23. Since IP is a connectionless protocol, there is no flow control and it is up to the upper-layer protocols (or the applications) to take care of losses. Some applications, of course, do not care much about the loss of data, and others do. Let's suppose the upper layer detects the loss of some data by some means (for instance, with a timer that expires due to the lack of acknowledgment) and tries a retransmission. Since it is not possible to selectively resend only the missing fragments, the L4 protocol has to retransmit the entire IP packet. Each retransmission can lead to some special conditions that have to be handled by the receiver side (and sometimes by intermediate routers as well when the latter implement some form of firewalling that requires packets to be defragmented). Here are some of them:
Things can get more complicated if you consider that fragments can get fragmented, too. 18.4.3.2. Associating fragments with their IP packetsBecause fragments could arrive out of order, defragmentation is a complex process that requires each packet to be recognized and put in its proper place as it arrives. The insert, delete, and merge operations must be easy and quick. To identify the IP packet a fragment belongs to, the kernel takes the following parameters into consideration:
Unfortunately, it is possible for different packets to share all of these parameters. For instance, two different senders could happen to choose the same packet ID for packets that happen to arrive at the same time. One might suppose that the source IP addresses would distinguish the packets, but what if both hosts sat behind a NAT router that put its own IP address on the packets? There is no way the recipient IP layer can distinguish fragments under these conditions. You cannot count on the IP ID field either, because it is a 16-bit field and can therefore wrap around pretty quickly on a fast network. Since the IP ID field plays a central role in the defragmentation process, let's see how IP fragments are organized in memory and how the IP IDs are generated. The most obvious implementation of an IP ID generator would be one that increments a global counter and uses it as the ID each time the IP layer is asked to send a packet. This would assure sequential IDs and easy implementation. This simple model, however, has some problems:
Thus, we have to accept the likelihood that the IP layer occasionally mixes together data from completely different packets. There is something wrong. Only the higher layers can fix the problemusually with error checking. The following section shows one way in which Linux reduces the likelihood of (but does not solve) the wraparound problem and ID prediction. The section "Selecting the IP Header's ID Field" in Chapter 23 shows the precise algorithm and code. 18.4.3.3. Example of IP ID generationThe wraparound problem is partially addressed by means of multiple, concurrent, global counters. Instead of a global IP ID, the Linux kernel keeps a different one for each destination IP address (up to the maximum number of possible IP destinations). Note that by using multiple IP IDs, you make the IDs take a little longer to wrap around, but eventually they will do so anyway. Figure 18-11 shows an example. Let's suppose we have traffic addressed to two servers with addresses IP1 and IP2. Let's suppose also that for each IP address we have different independent streams of traffic, such as HTTP, Telnet, and FTP. Because the IP IDs are shared by all the streams of traffic going to the same destination, the packets will have sequential IDs if you look at traffic to the destination as a whole, but the traffic of each application will not have sequential IDs. For instance, the IP packets to destination IP1 that are generated by a Telnet session are not sequential. Note that this is merely the solution chosen by Linux, and is not a standard. Other alternatives are available. 18.4.3.4. Example of unsolvable defragmentation problem: NATDespite all manner of cleverness at the IP layer, the rules of fragmentation lead to potential situations that the IP layer cannot solve. Figure 18-12 shows one of them. Let's suppose that R is a router doing NAT for all the hosts on its network. To be more precise, let's suppose R did masquerading:[*] the source IP addresses in the headers of the IP packets generated by the hosts in the internal network and addressed to the Internet are replaced with router R's IP address, 140.105.1.1.[
Let's also suppose that both PC1 and PC2 need to send some traffic to the same destination server S. What would happen if, by chance, two packets transmitted at more or less the same time had the same IP ID (in this example, 1,000)? Since the router R rewrites the source IP address changing 10.0.0.2 and 10.0.0.3 into 140.105.1.1, server S will think that the two IP packets it received both came from router R. In the absence of fragmentation, this is not a problem because the L4 information (for instance, the port number) distinguishes the two sources. In fact, that is what makes NAT usable in the first place. The problem arises when the two IP packets transmitted by R get fragmented before arriving at server S. In this case, server S receives fragments with the same source and destination IP address (140.105.1.1, 151.41.21.194) and the same IP ID (1,000), and therefore tries to put them together and potentially mixes the fragments of two different IP packets. As a consequence of this, both of the packets will be discarded because they are considered corrupted. In the very worst case, the two packets could have the same length and the overlapping could corrupt the payload without corrupting the L4 headers. The IP checksum covers only the IP header and therefore cannot detect this condition. Depending on the application, the consequences could be serious. Figure 18-11. Concurrent applications receiving non consecutive IP header IDs![]() After an enumeration of all the problems with fragmentation , we can understand better why the designers of the IPv6 protocol decided to allow IP fragmentation only at the originating hosts, and not at intermediate hosts such as routers. Figure 18-12. Example where NAT and IP fragmentation could give trouble![]() 18.4.4. Path MTU DiscoveryAfter the long discussion of the pitfalls of packet fragmentation, readers can well appreciate the next IP layer feature I'll discuss, path MTU discovery. When I described the net_device data structure in Chapter 2, I listed the MTUs of the most common interface types. The scope of the MTU is the LAN that the network interface is connected to. If you transmit an IP packet to another host on the same LAN as the interface you use to transmit, and the size of the packet is bigger than the LAN's MTU, the IP packet will have to be fragmented. However, if you chose a size that fits the MTU, you can ensure that no fragmentation will be required. When the destination host is not on a directly attached LAN, you cannot count on the LAN's MTU to derive whether fragmentation will take place. Here is where path MTU discovery comes in. Path MTU discovery is used to discover the biggest size a packet transmitted to a given destination address can have without being fragmented. That parameter is called the Path MTU (PMTU) . Basically, the PMTU is the smallest MTU encountered along all the connections along the route from one host to the other. Since the path between two endpoints can be asymmetric, it follows that there can be two different PMTUs for any given pair of hosts. Each host computes and uses the one appropriate for sending packets to the other. Furthermore, a change of route can lead to a change of PMTU. Since each destination IP address can use a different PMTU, it is cached in the associated routing table cache entry. We will see in Part VII that the routes in the routing table can aggregate several IP addresses; for instance, you can have a route that says that network 10.0.1.0/24 is reachable via gateway 10.0.2.1. The routing table cache, on the other hand, has one single entry for each destination IP address the host has been talking to in the recent past.[*] You may therefore have an entry for host 10.0.1.2 and another one for 10.0.1.3, even though they are reached through the same gateway. Each of those entries includes a unique PMTU. You may object that, if those two addresses belong to two hosts within the same LAN, a third host would probably use the same route to reach both hosts and therefore share the same PMTU. It would make sense to keep just one PMTU in the routing table. This is unfortunately not possible. Just because one route is used to reach a bunch of addresses does not necessarily mean that they belong to the same LAN. Routing is a complex subject, and we will cover several aspects of it in Part VII.
Each routing table entry is associated with an egress device:[
Directly connected devices include the two endpoints of a telecom cable or devices on an Ethernet LAN. It's particularly important for all devices on the LAN (with no router between them) to share the same MTU for proper operation. If devices are not directly connectedthat is, if at least one router lies between themor if PMTU discovery is disabled, the PMTU by default is set to 576. This is not a random value, but is defined in the original IP RFC 791.[
Let's see how PMTU discovery works. The algorithm simply takes advantage of the IP header's fields used to handle fragmentation/defragmentation and the associated ICMP messages. If you transmit an IP packet with the DF flag set in the header and no one complains, it means that no fragmentation has taken place along the path to the destination, and that the PMTU you used is fine. This does not mean you are using the optimal size. You might well be able to increase the PMTU and still not have fragmentation. A simple example is where two Ethernet LANs are connected by a router. On both sides of the network, the MTU is 1,500, but hosts of each LAN use the MTU of 576 to talk to the hosts of the other LAN because they are not directly connected. This is not optimal. If you increase the size of the packets in a probe to their optimal size, you will be notified with an ICMP message when you cross the real PMTU. The ICMP message will include the MTU of the device that complained so that the kernel can update the local PMTU accordingly. Linux can be configured to handle path MTU discovery in one of the following ways:
When path MTU discovery is enabled, the PMTU associated with a route can change at any time to include routers with a smaller maximum size, resulting in the source receiving an ICMP FRAGMENTATION NEEDED message (see the discussion of icmp_unreach in Chapter 25). In this case, the PMTU is updated for all the entries in the routing cache with the same destination.[*] Refer to the section "Expiration Criteria" in Chapter 33 for details on how the reception of the ICMP FRAGMENTATION NEEDED message is handled by the routing table. It should be noted that the algorithm always shrinks the PMTU, it never increases it. However, the entries of the routing cache whose PMTU is derived from an ingress ICMP FRAGMENTATION NEEDED message expire after some time, which is equivalent to going back to the (bigger) default PMTU. See the same section just referenced for more details.
The PMTU of a route can also be set manually when adding the route through the ip route command. Even if path MTU discovery was enabled, it is still possible to lock the current PMTU so that it will not be changed. This happens in two main cases:
In Linux, the ip_dont_fragment function (shown in Chapter 22) uses the considerations described here to decide whether a packet should be fragmented when it exceeds the PMTU. The value of the PMTU on a given transmission can also be influenced by the following factors:
![]() |