嵌入式linux中文站在线图书

Previous Page
Next Page

19.3. IP Options

Because of the overhead associated with the time needed to process IP options , they have never been used much. In the next sections, we will see one by one the IP options handled by the Linux kernel and how they are processed.

Here are the main APIs involved with IP option management, all of them defined in net/ipv4/ip_options.c. To understand some of them, remember that not all of the IP options of a packet need to be replicated in all of its fragments.


ip_options_compile

Parses a block of options from an IP header and initializes an instance of an ip_options structure accordingly. This structure will be used later to process the options; it includes flags and pointers that tell the part of the routing subsystem that handles forwarding what has to be written into the IP header options space, and where. ip_options_compile is described in detail in the section "Option Parsing."


ip_options_build

Initializes the portion of an IP header dedicated to the options, based on an input ip_options structure. This function is used when transmitting locally generated packets. Thanks to an input parameter, it can distinguish fragments and treat them accordingly: it omits from the header of each fragment those options that do not have to be copied into that fragment (see the section "IP options" in Chapter 18), and overwrites them with null options instead. It also clears the flags of the ip_options structure (such as opt->rr_needaddr) that are used to signal the need to add a timestamp or an address to the options.


ip_options_fragment

Because the first fragment is the only one that inherits all the options of the original packet, the size of its header is supposed to be greater than or equal to the size of the following ones. Linux simplified this rule, however. By keeping the same header size for all fragments, Linux makes the fragmentation process simpler and more efficient. This is achieved by copying the original header with all its options and overwriting the options that do not need to be replicated (those where IPOPT_COPY is not set) with null options (IPOPT_NOOP) and clearing all the flags of the ip_options structure associated with them (e.g., ts_needaddr), on all fragments but the first one. Null options are described later in the section "Option Parsing."

This last operation is exactly the purpose of ip_options_fragment. When we talk about ip_fragment in Chapter 22, we will see that after the first IP fragment has been sent, the kernel calls ip_options_fragment to change the IP header, and recycles the new adapted header thereafter for all of the following fragments.


ip_forward_options

When forwarding a packet, some options may need to be processed. ip_options_compile parses the options and initializes a set of flags in the ip_options structure used to store the result of the parsing. Later, ip_forward will handle them.


ip_options_get

This function receives a block of options, parses them with ip_options_compile, and stores the result in an ip_options structure it allocates. It can receive the input options from either kernel space or user space; there is an input parameter to specify the source. An example of usage is via the ip_setsockopt function that is used by L4 protocols such as TCP and UDP to set the IP options on a given socket (see the system call setsockopt). ip_options_get takes care of the padding described in the section "'End of option list' and 'No operation' options" in Chapter 18.


ip_options_echo

Given an ingress IP packet and its IP options, this function builds the IP options to use to reply back to the sender. For example, the source route options must be reversed on the reply packet. Refer to RFC 1122 (Requirements for Internet Hosts), sections 3.2.1.8, 4.1.3.2, and 4.2.3.8, and to RFC 1812 (Requirements for IP Version 4 Routers).

Some of the places where this routine is invoked include:

  • icmp_reply to reply to an ingress ICMP request

  • icmp_send when an ingress IP packet meets conditions that require the generation of an ICMP message

  • ip_send_reply, which is the generic routine provided by IP to reply to an ingress IP packet

  • TCP to save the options of an ingress SYN segment

Now let's see how the functions are used in practice. Because you have not yet seen the internals of all the functions in Figure 18-1 in Chapter 18, you may not understand everything at this stage. You can come back to this second part of the section once you are familiar with the other functions.

As you saw in Figure 18-1 in Chapter 18, different paths can lead to the transmission of a packet, and they handle the IP options in slightly different ways. I will cover two cases and leave you the others as an exercise.

19.3.1. Option Processing

The options of an ingress IP packet are first parsed with the ip_options_compile function, described in the next section. As mentioned in the previous section, the options are then processed by different routines at different times, depending on whether a packet is to be forwarded, fragmented, etc. Figure 19-3 summarizes where the key routines introduced in the previous section (with a lighter color) are called for ingress packets and for locally generated packets.

When an ingress packet is to be forwarded, ip_rcv_finish calls ip_forward (via dst_input) to take care of the forwarding process. ip_forward handles the Router Alert option, if present, and makes sure that there are no problems with the strict source route option. Then it asks ip_forward_finish to complete the job of forwarding. The latter can behave differently depending on whether the header contains options.

Let's suppose the packet had options. In this case, ip_forward_finish calls ip_forward_options to handle those options that should be processed when forwarding a packet, and then calls dst_output to carry out the actual transmission. As shown in Figure 18-1 in Chapter 18, dst_output ends up calling ip_output when the ingress IP packet needs to be forwarded.

Figure 19-3. (a) Ingress packets; (b) locally generated packets


At this stage, the IP header is ready to be used, because all of the options have been processed. If there was no fragmentation, options processing is finished. However, if the packet needs to be fragmented, ip_output needs to make sure that only the first fragment includes all of the options; the others should have only a subset, according to Table 18-1 in Chapter 18. In this case, ip_output calls ip_fragment. Once the first fragment is done, ip_fragment uses ip_options_fragment to clear the options that are not needed for the subsequent fragments. This way, ip_fragment can keep copying the IP header from the original packet and have all the options correct.

In a locally generated packet, options are handled with ip_options_build. We will see in Chapter 21 how that function is used by ip_queue_xmit and ip_push_pending_frames.

19.3.2. Option Parsing

Parsing, here, means extracting the IP options from the format in which they are stored in an IP packet's header and storing them in a structure called ip_options that is more convenient for program code to handle. Storing them in a dedicated data structure is useful because different options are handled in different parts of the IP code. ip_options_compile only parses the options, it does not process them. We saw in the previous section where options are processed.

The function ip_options_compile is called in two different cases:

  • By ip_rcv_finish to parse and validate the IP options of the input packets. As shown in Figure 18-1 in Chapter 18, ip_rcv_finish is called for all ingress packets, regardless of whether they will be delivered locally or forwarded. When I refer to ingress packets in this section, I am including the case of ingress packets that need to be forwarded because they are not addressed to the local system.

  • By ip_options_get, for example, to parse the input to the setsockopt system call for AF_INET sockets.

Let's now analyze how ip_options_compile parses the options of an IP packet's header. This is the function's prototype:

int ip_options_compile(struct ip_options * opt, struct sk_buff * skb)

The values of the two input parameters let the function know the context in which it is being called:

  • Ingress packet: skb not NULL (in this case, opt is NULL)

  • Packet being transmitted: skb equal to NULL (in this case, opt is non-NULL)

This means that depending on the function's context, the IP header is stored in different places. When transmitting a locally generated packet, opt is not NULL and opt->data contains a pointer to an IP header that was previously partially generated by the caller. If instead the function is processing an ingress packet, the header is contained in the skb input buffer and opt is NULL. In this second case, the ip_options structure is stored in skb->cb. ip_options_compile initializes local variables such as optptr according to where the IP header is located (i.e., skb->nh or opt->_ _data). The value of skb is also often used by ip_options_compile to distinguish between the two previous cases.

In both cases (transmit and forward), you need to fill in opt. The only choices to make are where to get the input IP header to parse and where to store the result.

    if (!opt) {
        opt = &(IPCB(skb)->opt);
        memset(opt, 0, sizeof(struct ip_options));
        iph = skb->nh.raw;
        opt->optlen = ((struct iphdr *)iph)->ihl*4 - sizeof(struct iphdr);
        optptr = iph + sizeof(struct iphdr);
        opt->is_data = 0;
    } else {
        optptr = opt->is_data ? opt->_ _data : (unsigned char*)&(skb->nh.iph[1]);
        iph = optptr - sizeof(struct iphdr);
    }

If parsing fails, ip_options_compile returns immediately. The caller will handle the event in one of the following ways, depending on whether the options were used by a received or transmitted packet:


Bad option in a received packet

An ICMP message is sent back to the source.


Bad option in a transmitted packet

The application is notified through an error value returned by the function used to transmit the packet.

Among the possible reasons for a parsing failure are:

  • A single option cannot be present more than once in the header. The only exception is the dummy or null option IPOPT_NOOP. The latter can be present any number of times and is usually used to enforce some kind of alignment, either on an individual option or on the payload that follows the options (the null option needs no handling).

  • The value of a header field has been assigned an invalid value, or a value that the current user is not allowed to use. This case applies to locally generated traffic. Only the superuser is allowed to generate IP packets with option or suboption codes not understood by the kernel. The check for the superuser privilege is done by the capable function.

    The original IP RFC says that when receiving an option that is not understood, a router should just ignore it. Linux behaves differently only with locally generated packets (see the earlier reference to capable).

Currently, there are only two single-byte options:

  • End of options (IPOPT_END)

  • Null option (IPOPT_NOOP)

The main for loop simply goes option by option and stores the result of parsing in the output ip_options structure opt. The code inside the loop may look complex, but actually it is very easy to read if you take into consideration the following points:

  • l represents the size of the block of options that has not been parsed yet.[*]

    [*] While reading the code, make sure you do not confuse the variable l, used as the index of the for loop, with the integer 1. They look quite the same and it is easy to lose an hour trying to understand the code if you confuse them. It has already happened to one person.

  • optptr points to the current position on the block of options being analyzed. optptr[1] is the option's length, and optptr[2] is the option pointer (where the option starts). Figure 19-4 shows where the array's elements point. The code that handles each option always starts with two sanity checks based on these parameters.

  • optlen gets initialized to the length of the current option. Do not confuse optlen with opt->optlen. Note that when opt is not NULL, optlen is not initialized because that has already been done in ip_options_get.

  • The flag is_changed is used to keep track of when the header has been changed (which requires the checksum to be recomputed).

Figure 19-4. ip_options_compile's local variables' values in the middle of an execution


There cannot be other options after the IPOPT_END option. Therefore, as soon as one is found, whatever follows it is overwritten with more IPOPT_END options.

The basic sanity checks for multibyte options include:

  • The option must be at least four bytes long. Since the header of the option is three bytes long, the field pointer cannot be smaller than 4. The timestamp option, for instance, requires at least a length of five octets, where four are used just by the header (See Figure 18-8 in Chapter 18).

  • Options that reserve space in the header, because they are supposed to be filled in by the next hops or by the destination host, must respect the size required by the option. For instance, the timestamp option is supposed to reserve a space that is a multiple of four bytes (the size of an IPv4 address).

Since the length of each option includes the first two bytes (type and length) and since it starts counting from 1 (not 0), if length is less than 2 or bigger than the block of options left to analyze, there is an error:

        if (optlen<2 || optlen>l) {
            pp_ptr = optptr;
            goto error;
        }

Note that some options (such as TIMESTAMP) have a minimum length bigger than 2, and thus the general check just shown is necessary but not always sufficient. The more specific checks are inside the per-option handlers. When an error is found in the options, a special ICMP message has to be sent back to the sender. This ICMP packet includes the original IP header, eight bytes of the IP payload, and an offset that points to where the error was found. The eight bytes of the IP payload consist of the start of the L4 header and usually include the L4 port numbers; this allows the receiver of the ICMP error message to find the socket associated with the faulty IP packet (more details in Chapter 25). Before returning the error message, the code initializes pp_ptr to point to the place where the problem was found.

The switch statement uses, as its discriminator, the option type field. Therefore, each option in handled by a different statement, exactly as was done before for the single-byte options:

        switch (*optptr)

The next sections analyze the multibyte options one by one, and Figures 19-5(a) and 19-5(b) show the big picture. The two obsolete options SEC and SIC are recognized but not processed[*] (see RFC 1812).

[*] There are some other IP options, such as the IP MTU Discovery Option (RFC 1063), that were defined but never really used or found useful in past years, and that were therefore made obsolete. IP MTU Discovery in particular has been replaced by path MTU discovery (RFC 1191, covered in the section "Path MTU discovery" in Chapter 18).

19.3.2.1. Option: strict and loose Source Routing

Only one Source Routing option can appear in a header. The flag opt->srr is used to detect that condition: if the following code does not find any error in the option, it sets that flag. If another option of the same type appears later in the header, the error will be detected.

opt->is_strictroute is used to tell the caller whether the Source Routing option was loose or strict.

The section "ip_forward Function" in Chapter 20 shows how packets are dropped if they cannot reach their destinations while respecting the Source Routing rules.

The option is considered faulty if the length of the option (including type and length) is less than 3. This is because the value has to contain the type, length, and pointer fields. At the same time, pointer cannot have a value smaller than 4 because the first three bytes of the option are already used by the type, length, and pointer fields.

When the input skb parameter is NULL, it means that ip_options_compile has been called to parse the options of an outgoing packet (generated locally, not forwarded). In that case, the first IP address in the array of addresses provided by user space is saved in opt->faddr and then removed from the array by shifting the other elements of the array back one position with a memmove operation. This address will be retrieved later by the functions described in Chapter 21, ip_queue_xmit, and the ip_append_data's users, so they know the destination IP address. An easy-to-follow example of the use of opt->faddr can be found in the function udp_sendmsg.

            if (!skb) {
                if (optptr[2] != 4 || optlen < 7 || ((optlen-3) & 3)) {
                    pp_ptr = optptr + 1;
                    goto error;
                }
                memcpy(&opt->faddr, &optptr[3], 4);
                if (optlen > 7)
                    memmove(&optptr[3], &optptr[7], optlen-7);
            }
            opt->is_strictroute = (optptr[0] == IPOPT_SSRR);
            opt->srr = optptr - iph;
            break;

Figure 19-5a. ip_options_compile overview


Figure 19-5b. ip_options_compile overview


19.3.2.2. Option: Record Route

For the Record Route option, as for Timestamp, the sender reserves the part of the header it will use in advance. Because of this, when processing the option, new elements are added to the header only if there is some room left. If there is space, the ip_options_compile function sets the flag rr_needaddr to tell the routing subsystem to write the IP address of the outgoing interface into the IP header once the routing decision is taken.[*] Note that the list of IP addresses includes the transmitting interface's address if the options belong to a locally generated packet.

[*] This is done by calling ip_options_build. See Chapter 21.

            if (optptr[2] <= optlen) {
                if (optptr[2]+3 > optlen) {
                    pp_ptr = optptr + 2;
                    goto error;
                }

                if (skb) {
                    memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
                    opt->is_changed = 1;
                }
                optptr[2] += 4;
                opt->rr_needaddr = 1;
            }
            opt->rr = optptr - iph;
            break;

Since skb is non-null only when you are processing the options of an ingress packet, this piece of code simply copies the preferred source IP address into the list of addresses being recorded in the header, and updates the flag is_changed, which will force the IP checksum to be updated. See the section "Preferred Source Address Selection" in Chapter 35 for the reason why the rt_spec_dst IP address is used.

Whether the address is written in the block of code shown here, because the packet is being forwarded, or will be written later thanks to the flag rr_needaddr that is set later, the pointer field of the option is moved ahead four bytes (the size of the IP address). This explains why ip_forward_options (which will be executed if the packet we are processing is being forwarded) will have to go back four bytes to write the IP into the right position.

19.3.2.3. Option: Timestamp

Because optlen represents the length of the option being analyzed, the if statement simply checks whether any space is left to store the new information. In this case, the length of the option represents the space reserved by the transmitter (not the space used so far).

        if (optptr[2] <= optlen) {
                _ _u32 * timeptr = NULL;

The handling of the option depends on the suboption specified by the sub-type field in Figure 18-8 in Chapter 18, but the three suboptions are handled in the same general way. Regardless of the subtype, whoever is going to handle the option needs two pieces of information (which will be stored in the ip_option structure):

  • Whether it must record an address, a timestamp, or both

  • Where in the IP header the information has to be written (the offset)

If a timestamp needs to be recorded (this would be true for the TS_ONLY and TS_TSANDADDR cases), timeptr would be initialized to point to the right place where it should be written inside the IP header. Note also that timeptr is initialized only when skb is not NULL, which is the case when the option belongs to an ingress packet (as opposed to one that is locally generated).

We already saw in the section "Option Parsing" that ip_options_compile can also be called when handling locally generated packets. In that case, skb would be NULL, so timeptr would not be initialized (i.e., it would be left NULL) and no timestamp would be recorded in the header. There is nothing wrong here, because the timestamp will be put there by ip_options_build. That function will store the timestamp because opt->ts_needtime equals 1.

The only difference between processing an ingress packet to be forwarded and a locally generated packet is that in the former case, a timestamp is added to the IP header and the checksum has to be recomputed (so opt->is_changed needs to be set as well).

When the subcode is IPOPT_TS_PRESPEC, the timestamp has to be added only when the next IP address to match is local to the system. The function used to make that check is inet_addr_type; here are the main return values:


RTN_LOCAL

The IP address belongs to a local interface.


RTN_UNICAST

The IP address is reachable according to the routing table and is unicast.


RTN_MULTICAST

The address is multicast.


RTN_BROADCAST

The address is broadcast.

Since local broadcasts and registered multicast addresses could be considered local (i.e., addresses the system listens to), the following piece of code that checks RTN_UNICAST does exactly what we wantit determines whether the address is local:

        {
            u32 addr;
            memcpy(&addr, &optptr[optptr[2]-1], 4);
            if (inet_addr_type(addr) == RTN_UNICAST)
                break;
            if (skb)
                timeptr = (_ _u32*)&optptr[optptr[2]+3];
        }
        opt->ts_needtime = 1;

Depending on the suboption being processed, the timestamp has to be written at a different offset within the IP header. The first part initializes timeptr accordingly, and the second part copies the timestamp to the right position. Depending on the suboption, the ts_needtime and tr_needaddr flags are also initialized.

        if (timeptr) {
            struct timeval tv;
            _ _u32  midtime;
            do_gettimeofday(&tv);
            midtime = htonl((tv.tv_sec % 86400) * 1000 + tv.tv_usec / 1000);
            memcpy(timeptr, &midtime, sizeof(_ _u32));
            opt->is_changed = 1;
        }

This last part takes care of the counter overflow we described in the section "Timestamp Option" in Chapter 18.

        unsigned overflow = optptr[3]>>4;
        if (overflow == 15) {
            pp_ptr = optptr + 3;
            goto error;
        }
        opt->ts = optptr - iph;
        if (skb) {
            optptr[3] = (optptr[3]&0xF)|((overflow+1)<<4);
            opt->is_changed = 1;
        }

19.3.2.4. Option: Router Alert

As we explained in the section "Router Alert Option" in Chapter 18, the last two bytes of this option must be zero. If this option passes the sanity check, ip_options_compile initializes the router_alert flag so that later ip_forward will handle it accordingly. (opt->router_alert is simply treated as Boolean, zero, or nonzero.)

            if (optptr[2] == 0 && optptr[3] == 0)
                opt->router_alert = optptr - iph;

19.3.2.5. Handling parsing errors

If the error was found in a locally generated packet (skb==NULL), the function simply returns an error that will have to be handled by the caller. If instead it was found on a received IP packet, an ICMP error message has to be sent back to the source:

error:
    if (skb) {
        icmp_send(skb, ICMP_PARAMETERPROB, 0, htonl((pp_ptr-iph)<<24));
    }
    return -EINVAL;
}


Previous Page
Next Page