35.9. Effects of Multipath on Next Hop SelectionIn both ip_route_input_slow and ip_route_output_slow, fib_select_multipath is called only when:
The following code shows how fib_select_multipath is called to select the next hop: #ifdef CONFIG_IP_ROUTE_MULTIPATH if (res.fi->fib_nhs > 1 && fl.oif == 0) fib_select_multipath(&key, &res); #endif We already saw in the section "Next Hop Selection" in Chapter 31 how Linux selects the next hop to use when more than one is available. Let's see now how that algorithm is implemented. We saw in the section "Organization of Routing Hash Tables" in Chapter 34 that a route is represented by the closely coupled data structures fib_node and fib_info, and that each fib_info includes an array of fib_nh data structures (one for each next hop specified in the route). First, let's clarify which fields of the fib_info and fib_nh structures are used to decide whether a next hop must be chosen among a pool of available next hops, and if so, which one is chosen. These are the fields used to store the multipath configuration:
The following fields are used to implement the weighted random roundrobin algorithm:
Now we'll look at how the implementation works. Everything starts with all the next hops having a number of tokens (nh_power) that is the same as their weights. This number, as we've seen, is 1 by default. The change_nexthops loop sets the next hops' nh_power field while accumulating the total weights in the function's local variable, power. spin_lock_bh(&fib_multipath_lock); if (fi->fib_power <= 0) { int power = 0; change_nexthops(fi) { if (!(nh->nh_flags&RTNH_F_DEAD)) { power += nh->nh_weight; nh->nh_power = nh->nh_weight; } } endfor_nexthops(fi); fib_info->fib_power is initialized to the sum of the next hop's weight. Because it is decremented each time fib_select_multipath makes a decision (in code shown later in this section), each next hop will be selected a number of times equal to its weight by the time fib_power reaches 0. This also ensures that by the time fib_info->fib_power reaches 0, each next hop has been selected a number of times proportional to its weight. fi->fib_power = power; if (power <= 0) { spin_unlock_bh(&fib_multipath_lock); res->nh_sel = 0; return; } } The selection of a next hop by fib_select_multipath is pseudorandom: every time fib_select_multipath is called, it generates a random number w ranging from zero to fib_info->fib_power-1, and then browses all the next hops until it finds one that has a number of tokens (fib_nh->nh_power) greater than or equal to w. Note that w is reduced at each loop, making each loop more likely to find a next hop that matches this condition. w = jiffies % fi->fib_power; change_nexthops(fi) { if (!(nh->nh_flags&RTNH_F_DEAD) && nh->nh_power) { if ((w -= nh->nh_power) <= 0) { nh->nh_power--; fi->fib_power--; res->nh_sel = nhsel; spin_unlock_bh(&fib_multipath_lock); return; } } } endfor_nexthops(fi); res->nh_sel = 0; spin_unlock_bh(&fib_multipath_lock); 35.9.1. Multipath CachingFigure 35-4 shows when the fib_select_multipath routine described in the previous section is called for both ingress and egress traffic, as well as how support for multipath caching influences the way the routing cache is populated by ip_mkroute_input and ip_mkroute_output. Let's analyze the ingress and egress cases separately.
In both casesres->nh_sel, that is, the result of the next hop selectionis initialized to the last next hop of the multipath route. For subsequent packets, the selection will be done at cache lookup time. See the section "Multipath Caching" in Chapter 33. ![]() |