嵌入式linux中文站在线图书

Previous Page
Next Page

35.9. Effects of Multipath on Next Hop Selection

In both ip_route_input_slow and ip_route_output_slow, fib_select_multipath is called only when:

  • Multipath support is included in the kernel (CONFIG_IP_ROUTE_MULTIPATH).

  • The routing lookup with fib_lookup returns a route with more than one next hop (fib_nhs> 1).

  • The egress interface was not provided with the search key.

  • The destination address is not a local, broadcast, or multicast address.

The following code shows how fib_select_multipath is called to select the next hop:

#ifdef CONFIG_IP_ROUTE_MULTIPATH
    if (res.fi->fib_nhs > 1 && fl.oif == 0)
        fib_select_multipath(&key, &res);
#endif

We already saw in the section "Next Hop Selection" in Chapter 31 how Linux selects the next hop to use when more than one is available. Let's see now how that algorithm is implemented.

We saw in the section "Organization of Routing Hash Tables" in Chapter 34 that a route is represented by the closely coupled data structures fib_node and fib_info, and that each fib_info includes an array of fib_nh data structures (one for each next hop specified in the route).

First, let's clarify which fields of the fib_info and fib_nh structures are used to decide whether a next hop must be chosen among a pool of available next hops, and if so, which one is chosen.

These are the fields used to store the multipath configuration:


fib_info->fib_nhs

Number of next hops defined by the route.


fib_info->fib_nh

Array of fib_nh structures. The size of the array is given by fib_info->fib_nhs.

The following fields are used to implement the weighted random roundrobin algorithm:


fib_info->fib_power

This is initialized to the sum of the weights (fib_nh->nh_weight) of all the next hops of this fib_info instance, excluding ones that are disabled for some reasons (tagged with the RTNH_DEAD flag). Every time fib_select_multipath is called to select a next hop, fib_power is decremented. Its value is reinitialized when it reaches zero.


fib_nh->nh_weight

Weight of this next hop. When not explicitly configured, it is set to a default value of 1. As we will see, this value is used to make fib_select_multipath select the next hops proportional to their weights (relative to fib_info->fib_power).


fib_nh->nh_power

Tokens allowing this next hop to be selected. This value is first initialized to fib_nh->nh_weight when fib_info->fib_power is initialized. Its value is decremented every time this next hop is selected by fib_select_multipath. When the value reaches zero, this next hop is no longer selected until nh_power is reinitialized to fib_nh->nh_weight (which happens when fib_info->fib_power is reinitialized).

Now we'll look at how the implementation works.

Everything starts with all the next hops having a number of tokens (nh_power) that is the same as their weights. This number, as we've seen, is 1 by default. The change_nexthops loop sets the next hops' nh_power field while accumulating the total weights in the function's local variable, power.

    spin_lock_bh(&fib_multipath_lock);
    if (fi->fib_power <= 0) {
        int power = 0;
        change_nexthops(fi) {
            if (!(nh->nh_flags&RTNH_F_DEAD)) {
                power += nh->nh_weight;
                nh->nh_power = nh->nh_weight;
            }
        } endfor_nexthops(fi);

fib_info->fib_power is initialized to the sum of the next hop's weight. Because it is decremented each time fib_select_multipath makes a decision (in code shown later in this section), each next hop will be selected a number of times equal to its weight by the time fib_power reaches 0. This also ensures that by the time fib_info->fib_power reaches 0, each next hop has been selected a number of times proportional to its weight.

        fi->fib_power = power;
        if (power <= 0) {
            spin_unlock_bh(&fib_multipath_lock);
            res->nh_sel = 0;
            return;
        }
    }

The selection of a next hop by fib_select_multipath is pseudorandom: every time fib_select_multipath is called, it generates a random number w ranging from zero to fib_info->fib_power-1, and then browses all the next hops until it finds one that has a number of tokens (fib_nh->nh_power) greater than or equal to w. Note that w is reduced at each loop, making each loop more likely to find a next hop that matches this condition.

    w = jiffies % fi->fib_power;
    change_nexthops(fi) {
        if (!(nh->nh_flags&RTNH_F_DEAD) && nh->nh_power) {
            if ((w -= nh->nh_power) <= 0) {
                nh->nh_power--;
                fi->fib_power--;
                res->nh_sel = nhsel;
                spin_unlock_bh(&fib_multipath_lock);
                return;
            }
        }
    } endfor_nexthops(fi);
    res->nh_sel = 0;
    spin_unlock_bh(&fib_multipath_lock);

35.9.1. Multipath Caching

Figure 35-4 shows when the fib_select_multipath routine described in the previous section is called for both ingress and egress traffic, as well as how support for multipath caching influences the way the routing cache is populated by ip_mkroute_input and ip_mkroute_output. Let's analyze the ingress and egress cases separately.


Ingress traffic

When the kernel does not have support for multipath caching, ip_mkroute_input calls fib_select_multipath when the conditions listed in the previous sections are met, and selects one next hop according to the logic described earlier.

When the kernel has support for multipath caching, it does not select one next hop with fib_select_multipath. Instead, it loops over all the next hops of the Multipath route and adds an entry to the cache for each one. For each route, it also calls multipath_set_nhinfo, described in the section "Interface Between the Routing Cache and Multipath" in Chapter 33. That function can be used by the caching algorithm to update the local information it uses to select the next hop. For example, the weighted random algorithm uses the function to populate its database of next hops (see the section "Weighted Random Algorithm" in Chapter 33).


Egress traffic

As shown in Figure 35-4, the egress case is pretty similar to the ingress case. The only difference is that even when the kernel supports Multipath caching, fib_select_multipath is called and the latter invocation of ip_mkroute_output overrides the selection made by fib_select_multipath.

In both casesres->nh_sel, that is, the result of the next hop selectionis initialized to the last next hop of the multipath route. For subsequent packets, the selection will be done at cache lookup time. See the section "Multipath Caching" in Chapter 33.


Previous Page
Next Page