31.2. Concepts Behind Multipath Routing
Multipath is a feature that allows an administrator to specify multiple next hops for a given route's destination. In environments with substantial requirements, there are several reasons for doing this. A router could just use one ISP most of the time, and switch to the other when the first one fails for some reason. Another application of multipath is to keep a path on standby and enable it only when bandwidth requirements surpass a predefined threshold.
Figure 31-3 shows a topology where the network on the left is connected to the Internet via router RT, which is configured to use two uplinks simultaneously via two different ISPs.
Figure 31-3. Example of topology with multipath
Let's suppose we want to have RT use both RT1 and RT2 as default gateways, keeping them always available. On RT, we could define a multipath route, simply by providing the route with more than one next hop. The following user-space command, using the newer IPROUTE2 package, would enable multipath:
ip route add default scope global nexthop via 100.100.100.1 weight 1 nexthop via 184.108.40.206 weight 2
Note that even if the route includes multiple next hops, the route is still considered a single route. Therefore, given a route (in our example, the default route 0.0.0.0/0) with more than one next hop, the kernel needs a mechanism to select the next hop to use each time the route matches a route lookup. There are different ways to do that, each one with its pros and cons. For an interesting analysis of the most common algorithms for multipath routing, I suggest you read RFCs 2991 and 2992.
Linux provides flexibility among algorithms by allowing the administrator to assign each next hop a weight with the weight keyword. The number of times a next hop is selected is proportional to its weight in relation to all the other next hops. If all the next hops are assigned the same weight, the algorithm falls back to the so-called equal cost multipath algorithm.
Note, however, that the granularity used to distribute traffic among the next hops is measured not in packets, but in the number of routing cache entries. This is because once a next hop is selected, an entry is added to the cache. Because the routing subsystem always consults the cache before invoking any check on routing tables, subsequent packets belonging to the same traffic flow (aggregate of traffic) will be handled straight from the cache. As explained in Chapter 36, a flow is a collection of packets that match a set of criteria. These consist mainly of the source or destination addresses, the ingress or egress devices, and the IP TOS field. You will see in the section "Per-Flow, Per-Connection, and Per-Packet distribution" that when Multipath support for the cache is enabled, traffic can also be distributed on a per-connection basis instead of on a per-flow basis.
From purely a throughput point of view, this granularity may be suboptimal, because different flows may have very different bandwidth requirements, and therefore the kernel may be unfair even when all of the next hops are configured with the same weightand what is worse, the unfairness would not be deterministic. So Linux provides an option that allows you to use per-packet rather than per-flow granularity (see the section "Equalizer algorithm"). However, in most cases, given the high number of flows that usually traverse a router, the next hops are likely to get, on average, a load that is proportional to their weights.
31.2.1. Next Hop Selection
The selection of the next hop is based on a weighted round-robin algorithm.
We saw in the previous section a sample user-space command that specified a weight for each next hop. Usually, an administrator assigns a weight to each path to indicate whether it is preferred. That is the weight used by the round-robin algorithm. The method used to define the weight is an administrative issue based on criteria such as bandwidth and cost, so I will not go into detail about it.
The easiest way to select the next hops, proportionally to their weights, would be to simply have each one consume its tokens one by one and then restart. For instance, if we had two next hops with weights of 3 and 5, respectively, we could select the first one three times, the second one five times, and then again the first one three times, etc. But the distribution of traffic with this approach could be too bursty.
Therefore, Linux adds a randomness component to the selection of the next hop. Given the weight Wi for the ith next hop, and given the sum W of all the next hops, Linux selects a next hop randomly W times, and each next hop is selected a number of times equal to its weight Wi. The randomness introduced is not too accurate, but it is an acceptable approximation, and it falls back to a simple sequential selection (from first to last next hop) when all the next hops are assigned the same weight 1.
Here is how it is implemented. The kernel defines the round-robin budget as the sum of all the next hop weights. The budget (number of tokens) of each next hop is initialized to the value of its weight. At each round, the kernel generates a random value ranging from 0 to the total round-robin budget. Then it browses the next-hop list until it finds one with a budget greater than or equal to the generated random value. After each next-hop selection, it decrements both the round-robin budget and the selected next hop's budget.
Note that it is possible for none of the next hops to match on the first round. Imagine a case with three next hops whose weights are 1, 2, and 3. The total budget would be 6. Valid random values are the ones in the range 0 to 5. However, values 4 and 5 would not select any next hop because none has a budget that big. When this happens, the kernel subtracts the weight of each nonmatching next hop from the total budget and checks again.
Let's continue our example to show how this works. Suppose our random number was 5. We start browsing the list of next hops. The first one has a budget of 1, which is not sufficient. We therefore do not select it, and reduce our requirement from the following next hop by lowering the random value to 5-1, or 4. The following next hop has a budget of 2, which again is not sufficient. So we lower the random value again to 4-2, or 2. The last next hop has a budget of 3, which is greater than or equal to 2 and therefore is selected. This, by the way, is the worst case in terms of performance: the last of the next hops is the one selected.
Next hops of a multipath route can be temporarily unavailable (see the section "Effects of Multipath on Next Hop Selection" in Chapter 35). These, of course, will not be taken into consideration by the next-hop selection algorithm.
31.2.2. Cache Support for Multipath
By default, the routing cache does not support multipath. Therefore, as we saw in the section "Concepts Behind Multipath Routing," once the algorithm in the section "Next Hop Selection" has chosen one of the next hops, it will be used for all subsequent traffic matching the same lookup key because a route is added to the cache with a reference to that next hop.
Starting with version 2.6.12, the Linux kernel comes with an option that allows the user to enable multipath support for the cache, and also allows the system administrator to select what algorithm to use to distribute traffic between the different next hops specified by a given multipath route.
Here are the available algorithms:
When you configure a route with IPROUTE2's ip route command, you can use the new mpath keyword to select the algorithm to use. This is an example of a route configured to use the round-robin algorithm:
ip route add 10.0.1.0/24 mpath rr nexthop via 192.168.1.1 weight 1 nexthop via 192.168.1.2 weight 2
When the mpath keyword is not provided, multipath caching is kept disabled on the route.
The weighted random and device round-robin algorithms are described in more detail in the next subsections.
220.127.116.11. Weighted random algorithm
Assume we have a multipath route with four next hops, assigned the weights 1, 1, 2, and 4. Let's align the four weights along a line as shown in Figure 31-4. The sum of the weights is 8, so if you generate a random number in the range 0-8 you can unequivocally identify a next hop in the line. For example, the value 2.8 would select the third-next hop. It should be clear that the next hops are selected proportionally to their weights.
Figure 31-4. Example of weighted random selection
18.104.22.168. Device round-robin algorithm
The next hops of a multipath route can be reachable through a single device, each can be reachable through a different device, or you can have a hybrid situation. These three cases are shown in Figure 31-5.
A pure round-robin algorithm would distribute traffic equally to the various next hops, but not necessarily equally to the various devices associated with those next hops. For example, a multipath route with three next hops, two of which share the same egress device, as in Figure 31-5(c), would load one of the two egress devices twice as much as the other.
Thus, the goal of the device round-robin algorithm is to distribute traffic equally among a pool of devices, instead of on a per-multipath-route basis. All traffic that matches any route configured to use this algorithm is considered a single aggregate of traffic to distribute equally among devices. Therefore, the decision concerning which device to use for a given multipath route depends not only on the devices previously used to route traffic with the same multipath route, but also on the devices used by other multipath routes.
Note that while a pure round-robin algorithm assumes that the bottleneck in the forwarding path is the target routers' CPUs, device round robin aims at optimizing the use of the devices' bandwidths, giving less importance to the target CPUs.
Figure 31-5. Different ways to assign next hops to interfaces
31.2.3. Per-Flow, Per-Connection, and Per-Packet Distribution
Given a multipath route, traffic matching the route could be distributed between the next hops on a per-flow, per-connection, or per-packet basis:
Per-connection and per-flow distribution are needed for connection-oriented protocols such as TCP to work correctly, but per-packet distribution could work well with connectionless protocols such as the User Datagram Protocol (UDP).
When there is no support for Multipath caching, Linux always distributes traffic between the different next hops of a Multipath route on a per-flow basis proportionally to the weights of the flows, as seen in the section "Concepts Behind Multipath Routing."
When multipath caching is enabled, traffic is distributed differently depending on where it originates:
22.214.171.124. Equalizer algorithm
The Linux kernel at times has offered per-packet distribution, called equalization. Here is an example of a command (not implemented in the current Linux kernel) that asks for an equalized route through the eql option:
# ip route add eql 100.100.100.0/24 nexthop via 10.0.0.2 nexthop via 10.0.0.3 # ip route list 100.100.100.0/24 equalize nexthop via 10.0.0.2 dev eth0 weight 1 nexthop via 10.0.0.3 dev eth0 weight 1 ...
Given how much time has passed since support for this option was announced as being in the works, it is not likely to be added anytime soon, probably because there is no need for it.