Jakub Sitnicki's blog

Last time we have looked at what is multipath routing, what are some potential use-cases for it, and how to set it up. In this post we explore deeper the multipath routing implementation in the Linux network routing subsystem and focus on the logic behind distributing packet flows between alternative (equal cost) paths.

Two stacks, two stories

While you might expect that multipath routing in Linux works the same for IPv4 and IPv6, that is not the case. As we have seen in "History dive" in part 1 support for ECMP (Equal-cost Multipath) routing in ipv4 and ipv6 stacks [1] has been introduced at different times and by different developers. Hence the implementations differ and this impacts how multipath routing can be configured and how it operates.

In the rest of this post I will concentrate solely on the state of Multipath Routing in Linux v4.11. However, because it is interesting to see how the code evolved over time, here is a timeline that attempts to recap the interesting points in the kernel development that affected multipath routing:

1997-Nov v2.1.68 Initial support for Multipath Routing for IPv4. Included support for weighted ECMP (which I talk about later). Next-hop selected in a pseudo-random fashion (jiffies value used as a random number).
2012-Sep v3.6 IPv4 routing cache removed, which made Multipath Routing unusable with connection-oriented protocols like TCP as different next-hops could be selected for packets belonging to the same flow. See this discussion at ServerFault.

2013-Feb v3.8 Initial support for Multipath Routing for IPv6. Packets belonging to one flow always routed toward the same next-hop thanks to path selection based on the flow hash. Weighted ECMP is not supported.
2016-Jan v4.4 Multipath routing for IPv4 switched to flow hash-based path selection, making connection-oriented protocols usable with ECMP again.

It is also worth mentioning that between v2.6.12 (2005-Jun) and v2.6.23 (2007-Oct) there has been an experimental cache support for Multipath Routing but it's not relevant to current (v4.11) implementation since routing cache has been removed [2].

The ECMP algorithm - mapping packets to next-hops

It is important to keep in mind that existing Multipath Routing implementation in Linux is designed to distribute flows of packets over multiple paths, not individual packets. Selecting route in a per-packet manner does not play well with TCP, IP fragments, or Path MTU Discovery.

To associate a packet with a flow, the net stack computes a hash over a subset of packet header fields. The resulting hash value is what drives the next-hop selection. In Linux v4.11 the selection of fields depends on the IP protocol version and whether we are forwarding or routing locally generated packets. The fields that serve as input to hashing function are:

for forwarded IPv4 packets (L3 hash)

{ Source Address, Destination Address }

for locally generated IPv4 packets (L4 hash) [3]

{ Source Address, Destination Address, Protocol, Source Port, Destination Port }

for forwarded IPv6 packets (L3 hash)

{ Source Address, Destination Address, Flow Label, Next Header (protocol) }

for locally generated IPv6 packets (L4 hash)

{ Source Address, Destination Address, Flow Label, Next Header (protocol), Source Port, Destination Port }

However, with recently released Linux v4.12 selection of fields has changed a bit for IPv4 [4]. An L3 hash is used by default for both forwarded and locally generated traffic, but the user can choose to use the L4 hash, in both forward and local output path, with a new sysctl - net.ipv4.fib_multipath_hash_policy.

How this flow-hash value is mapped to one of the available next-hops is where the Linux ipv4 and ipv6 stacks differ. The ipv4 stack implements a so called Hash-Threshold algorithm which splits all possible hash values into adjacent ranges, each one corresponding to one of the available next-hops. The ipv6 stack, on the other hand, employs a somewhat simpler Modulo-N algorithm, which computes the next-hop index from the hash value by means of modulo division.

Figure 1. Next-hop mapping with hash-threshold vs module-N algorithm

These mapping schemes are implemented by fib_select_multipath() and rt6_multipath_select() routines for ipv4 and ipv6, respectively.

With some setup and a bit of pointer-chasing in gdb we can see [5] how Linux ipv4 stack splits up the hash value range {0..INT_MAX} between available next-hops. Because the ranges are adjacent we only need to keep track of the upper bound.

# for i in {1..3}; do
>   ip link add test$i type dummy
>   ip link set dev test$i up
>   ip addr add dev test$i 192.168.$i.2/24
> done
# ip route add 2.2.2.2/32 \
>   nexthop via 192.168.1.1 \
>   nexthop via 192.168.2.1 \
>   nexthop via 192.168.3.1
# ip route show 2.2.2.2
2.2.2.2
        nexthop via 192.168.1.1 dev test1 weight 1
        nexthop via 192.168.2.1 dev test2 weight 1
        nexthop via 192.168.3.1 dev test3 weight 1
# gdb gdb /usr/lib/debug/usr/lib/modules/`uname -r`/vmlinux /proc/kcore
...
(gdb) p *(struct fib_info *)fib_info_hash[14].first
$41 = {
  ...
  fib_nhs = 3,
  ...
  fib_nh = 0xffff88003d1d9070
}
(gdb) set $fib_info = (struct fib_info *)fib_info_hash[14].first
(gdb) p $fib_info.fib_nh[0].nh_upper_bound
$32 = {
  counter = 715827882  # <-- INT_MAX/3
}
(gdb) p $fib_info.fib_nh[1].nh_upper_bound
$33 = {
  counter = 1431655764 # <-- INT_MAX/3*2
}
(gdb) p $fib_info.fib_nh[2].nh_upper_bound
$34 = {
  counter = 2147483647 # <-- INT_MAX
}
(gdb)

Both of the algorithms, Hash-Threshold and Modulo-N, are deterministic, in a sense that the same next-hop is always selected for the same flow, as long as the set of next-hops has not changed in the meantime. This is something that is desired in practice, enabling the same client to always talk to the same server.

It is also worth noting that the Hash-Threshold and the Modulo-N algorithms differ in how disruptive [6] they are to the packet flows when the set of next-hops changes. In this regard, the Modulo-N algorithm used by the ipv6 stack is more disruptive, that is a larger percentage of potential flows is affected (re-routed) when a next-hop is added to or removed from the set of available paths.

The algorithms are described and analyzed in more detail in:

RFC 2991 - Multipath Issues in Unicast and Multicast Next-Hop Selection
RFC 2992 - Analysis of an Equal-Cost Multi-Path Algorithm

Weighted ECMP - promoting selected next-hops

Weighted Equal-Cost Multi-Path routing is an extension to the concept of multipath routing where additionally every next-hop has an associated "weight". The weight affects how big of a fraction of all possible flows will be routed toward the next-hop.

Currently (as of Linux v4.12) only the ipv4 stack respects the next-hop weight and steers more flows toward a path with a higher weight. Digging into the code, we can see the weight (stored in nh_weight in struct fib_nh) being taken into account by fib_rebalance() routine which computes the upper range limit of a next-hop (nh_upper_bound) we looked at earlier.

To tell the kernel that we want the next-hops to have different weights we need to pass the weight parameter to ip route command. The specified value will be handed to the kernel in rtnh_hops attribute of a next-hop data structure.

For example, if we want 2/3 (or 10/15) of all possible flows to be routed towards 192.168.1.1 next-hop and the remaining 1/3 (or 5/15) toward 192.168.2.1 we can set up a multipath route with:

# ip route add 2.2.2.2/32 \
> nexthop via 192.168.1.1 weight 10 \
> nexthop via 192.168.2.1 weight 5
# ip route show 2.2.2.2
2.2.2.2
        nexthop via 192.168.1.1 dev dum1 weight 10
        nexthop via 192.168.2.1 dev dum2 weight 5

Just as before, we can take a look at fib_info structure to see that the next-hop-mapped hash value ranges have been set up as expected:

192.168.1.1 → {0..INT_MAX*2/3}
192.168.2.1 → {INT_MAX*2/3+1..INT_MAX}

Digging in again with gdb we see:

(gdb) set $ip = $fib_info->fib_nh[0].nh_gw
(gdb) printf "%d.%d.%d.%d\n", ($ip & 0xff), ($ip >> 8) & 0xff, ($ip >> 16) & 0xff, ($ip >> 24)
192.168.1.1
(gdb) p $fib_info->fib_nh[0].nh_weight
$14 = 10
(gdb) p $fib_info->fib_nh[0].nh_upper_bound.counter
$15 = 1431655764 # <-- INT_MAX/3*2

(gdb) set $ip = $fib_info->fib_nh[1].nh_gw
(gdb) printf "%d.%d.%d.%d\n", ($ip & 0xff), ($ip >> 8) & 0xff, ($ip >> 16) & 0xff, ($ip >> 24)
192.168.2.1
(gdb) p $fib_info->fib_nh[1].nh_weight
$16 = 5
(gdb) p $fib_info->fib_nh[1].nh_upper_bound.counter
$17 = 2147483647 # <-- INT_MAX

Linux ipv6 stack, however, ignores the next-hop weight, passed from user-space in rtnh_hops. We can confirm this by looking at inet6_rtm_newroute() invoked to process RTM_NEWROUTE message, which is send to create a new routing entry.

What inet6_rtm_newroute() does is - it calls a helper routine rtm_to_fib6_config() to extract the pointer to the first struct rtnexthop from the message and store it fc_mp field of struct fib6_config. The stored list of rtnexthop's is later processed by ip6_route_multipath_add() which does not access the rtnh_hops field.

Wrap up

In this post we have explored how Mulipath Routing implementation has evolved over time in Linux ipv4 and ipv6 stacks. We have also seen how the two stacks differ in what ECMP algorithm they use to map flows to next-hops. Lastly we have looked at weighed ECMP setup where some next-hops are favored more than others and discovered that we it can be only used with IPv4.

If you are running Fedora, you can quickly try out all the commands using the awesome virtme tool. Make sure you have virtme and kernel-debuginfo packages installed. Then simply run:

$ virtme-run --installed-kernel -a nokaslr --qemu-opts -m 1G
...
virtme-init: console is ttyS0
bash-4.3# for i in 1 2; do
>         ip link add dum$i type dummy
>         ip link set dev dum$i up
>         ip addr add dev dum$i 192.168.$i.2/24
> done
bash-4.3# ip route add 2.2.2.2/32 \
>    nexthop via 192.168.1.1 weight 10 \
>    nexthop via 192.168.2.1 weight 5
bash-4.3# ip route show 2.2.2.2
2.2.2.2
        nexthop via 192.168.1.1 dev dum1 weight 10
        nexthop via 192.168.2.1 dev dum2 weight 5
bash-4.3# gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /proc/kcore
GNU gdb (GDB) Fedora 7.12.1-48.fc25
...
(gdb)

Footnotes

[1]	From here on I will refer to the IPv4 network stack implementation in Linux that lives under net/ipv4/ as ipv4 (all lower case) to distinguish it from the IPv4 protocol. Same distinction applies to the IPv6 protocol and the ipv6 stack in Linux (net/ipv6/).

[2]	I stumbled upon the description of Multipath Routing cache in Understanding Linux Network Internals, "Cache Support for Multipath" section if you would like to know more about it.

[3]	since commit 9920e48b830a ("ipv4: use l4 hash for locally generated multipath flows")

[4]	see commit bf4e0a3db97e ("net: ipv4: add support for ECMP hash policy choice")

[5]	If all values in `gdb` show up as zeros you are likely running a kernel with KASLR enabled. Confirm by checking if `CONFIG_RANDOMIZE_BASE` build option was set. You can disable KASLR by passing `nokaslr` boot option to the kernel.

[6]	In this context, the flow is considered disrupted when the network path it takes changes.