20.1. ForwardingAs with many networking activities described in the previous chapter, forwarding is split into two functions: ip_forward and ip_forward_finish. The second is called at the end of the first, if Netfilter allows it. Both functions are defined in net/ipv4/ip_forward.c. By this time, thanks to the call to ip_route_input in ip_rcv_finish described in Chapter 19, the sk_buff buffer contains all the information needed to forward the packet. Forwarding consists of the following steps:
If the packet cannot be forwarded for some reason, the source host has to be notified with an ICMP message that describes the problem encountered. An ICMP message could also be sent as a warning even if the packet will be forwarded, as when a packet is routed with a suboptimal route and triggers a redirect. In the following sections, we'll examine these and other activities in the ip_forward function. Interaction with IPsec is a major part of forwarding, and is implemented by xfrm4_xxx routines in ip_forward, which are hooks into the IPsec infrastructure. They are not covered in this book for lack of space. The behavior documented here is how forwarding works when IPsec is not configured, in which case those calls simply becomes no-ops. 20.1.1. ICMP RedirectAn ICMP redirect message is sent by a host system (usually a router) when it has been asked to do something that another router is better suited to do (see Chapters 25 and 31 for more details). When a packet has been source routed, the router assumes the sender had a good reason for requesting the route and does not second-guess it. It honors the requested route and does not send an ICMP redirect message. This special case is covered in the section "ip_forward Function." 20.1.2. ip_forward FunctionAs we have seen, ip_forward is invoked by ip_rcv_finish (see Figure 18-1 in Chapter 18) to handle all input packets that are not addressed to the local system. The function receives as an input parameter the buffer skb associated with the packet; all the necessary information is inside that structure. skb->dst, the routing information, was initialized by the call to ip_route_input in ip_rcv_finish earlier in the code path (see Chapter 33 for more details). Figure 20-1 summarizes the internals of the function: int ip_forward(struct sk_buff *skb) Figure 20-1. ip_forward function![]() The function revolves around manipulations of skb and of a local variable iph, which represents the packet's IP header and is initialized repeatedly from the iph field of skb. (It has to be reinitialized because the header can be changed by some of the functions called from ip_forward.) If the Router Alert option was found in the IP header, it is handled now.[*] The function handler for this option is ip_call_ra_chain, which relies on a global list (ip_ra_chain) that contains the list of local sockets that set the IP_ROUTER_ALERT option because they are interested in IP packets that carry the Router Alert IP option. When an ingress IP packet is fragmented, ip_call_ra_chain first defragments the entire IP packet and only then delivers it to the Raw sockets of the ip_ra_chain list, as shown in Figure 18-1 in Chapter 18.
The functions that manage the alert can be found in net/ipv4/ip_sockglue.c (see, for example, ip_ra_control and how it is called by ip_setsockopt to apply an option to a socket as requested by the user with a call to the setsockopt system call). ip_forward has no further work to do, and returns success. If there is no Router_Alert option in the header, or if it is present but no interested processes are running (in which case ip_call_ra_chain returns FALSE), ip_forward continues: if (IPCB(skb)->opt.router_alert && ip_call_ra_chain(skb)) return NET_RX_SUCCESS; The following check is used just to make sure that the packet we're handling was actually addressed to our host at L2. skb->pkt_type is initialized at the L2 layer (see Chapter 13), and defines the type of frame. It is assigned the value PACKET_HOST when the frame is addressed to the L2 address of the receiving interface. If the lower-level functions do their jobs correctly, there should be no need for this check, but we do it just in case an error left us with a packet we should not have received in the first place. if (skb->pkt_type != PACKET_HOST) goto drop; Since we are forwarding the packet, we are operating entirely at the L3 layer and it is not our business to worry about the L4 checksum; we use CHECKSUM_NONE to indicate that the current checksum is OK. If some handling changes the IP header or the TCP header or payload later, before transmission, the kernel will recalculate the checksum there. skb->ip_summed = CHECKSUM_NONE; The real forwarding process starts by decrementing the TTL field. The IP protocol definition says that when TTL reaches the value of 0 (which means you received it with value 1 and it became 0 after you decremented it), the packet has to be dropped and a special ICMP message has to be sent to the source to let it know you dropped the packet. if (iph->ttl <= 1) goto too_many_hops; Note that the TTL field has not been decremented yet; it will be done a few lines of code later. The reason for waiting is that the packet may still be shared at this point with other subsystems such as sniffers; the header must be unchanged in that case . rt points to a data structure of type rtable, which contains all the information needed by the forwarding engine, including the next hop (rt_gateway). If the IP header contains a Strict Source Route option and the next hop (extracted from that option) does not match the one found by the routing subsystem, the Source Routing option fails and the packet is dropped. rt = (struct rtable*)skb->dst; if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway) goto sr_failed; In this case, another ICMP message is transmitted to the sender. After most of the sanity checks have been fulfilled, the function updates the packet header a bit and then gives it to ip_forward_finish. Since we are about to modify the content of the buffer, we need to make a local copy for ourselves. The copy is actually done by skb_cow only if the packet is shared (if the packet is not shared it can be safely modified) or if the space available at the head of the packet is not sufficient to store the L2 header. if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len)) goto drop; Now the TTL is decremented by ip_decrease_ttl, which also updates the IP checksum. ip_decrease_ttl(iph); If a better next hop is available than the requested one, the originating host is now notified with an ICMP REDIRECT messagebut only if the originating host did not request source routing. The opt->srr field indicates that source routing was requested, in which case the originating host doesn't care whether a supposedly better route is found. In Chapter 35 you will see when exactly the RTCF_DOREDIRECT flag is set on a cached route to indicate that the source of the packet should be sent an ICMP REDIRECT message. if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr) ip_rt_send_redirect(skb); The priority field is set here using the Type of Service field of the IP header. The priority will be used later by Traffic Control (the QoS layer). skb->priority = rt_tos2priority(iph->tos); The function terminates by asking Netfilter to execute ip_forward_finish, if there are no filtering rules that forbid forwarding. return NF_HOOK(PF_INET, NF_IP_FORWARD, skb, skb->dev, rt->u.dst.dev, ip_forward_finish); 20.1.3. ip_forward_finish FunctionIf this function is reached, it means the packet has passed all the checks that could stop it and is truly ready to be sent out to another system. Two possible options from the IP header have been handled so far, as we saw in the section "ip_forward Function": Router Alert and Strict Source Routing. Now we pass the packet to the function ip_forward_options to handle any final work required by the options. It can find out what needs to be done by checking flags (such as opt->rr_needaddr) and offsets (such as opt->rr) initialized earlier by ip_options_compile, which was invoked from ip_rcv_finish. ip_forward_options also recomputes the IP checksum in case it had to update any of the IP header fields. See the section "Option Processing" in Chapter 19. The packet is finally transmitted with dst_output, described in the next section: static inline int ip_forward_finish(struct sk_buff *skb) { struct ip_options * opt = &(IPCB(skb)->opt); IP_INC_STATS_BH(IPSTATS_MIB_OUTFORWDDATAGRAMS); if (unlikely(opt->optlen) { ip_forward_options(skb); return dst_output(skb); } It may seem we are close to the wire, but there are still a couple of tasks to do before having the device driver do the transmission. 20.1.4. dst_output FunctionAll transmissions, whether generated locally or forwarded from other hosts, pass through dst_output on their way to a destination host, as shown in Figure 18-1 in Chapter 18. The IP header at this point is finished: it embodies the information needed to transmit as well as any other information the local system was responsible for adding. static inline int dst_output(struct sk_buff *skb) { int err; for (;;) { err = skb->dst->output(&skb); if (likely(err = 0)) return err; if (unlikely(err != NET_XMIT_BYPASS)) return err; } } dst_output invokes the virtual function output, which has been initialized to ip_output if the destination address is unicast and ip_mc_output if it is multicast. Fragmentation is handled in that function. At the end, ip_finish_output is called to interface with the neighboring subsystem (see Figure 18-1 in Chapter 18). ip_finish_output, described in the section "Interface to the Neighboring Subsystem" in Chapter 21, is invoked only if Netfilter gives the green light (otherwise, the packet is dropped). Note that the output function can potentially be invoked multiple times if it returns the NET_XMIT_BYPASS value. This is, for instance, a simple mechanism to call a sequence of output routines. The IPsec protocol suite uses it to apply transformations before the real transmission. |