Multipath Routing (ECMP) in Linux - part 3

In this post we will take a look at how ECMP routing can cause problems with TCP and what are some recent developments in multipath routing support in the Linux kernel.

This post concludes the 3 part series on multipath routing (ECMP) in Linux. Before diving into it you might want to read part 1 and part 2 first.

Routing ICMP errors with the flow they belong to

ICMP messages are important for diagnosing and troubleshooting problems with network setups using tools like ping or traceroute. They also enable the Path MTU Discovery (PMTUD) mechanism, where ICMP Packet Too Big (PTB) [1] messages serve as a signal that an oversized IP datagram has been dropped. This feedback is used by TCP to adjust the Maximum Segment Size (MSS) so that the segments can get through without resorting to fragmentation.

In a setup where clients are talking to servers that are behind an ECMP router [2] it is important that an ICMP PTB error message triggered by an oversized reply from the server gets routed back to the server that sent the reply. Otherwise, the ECMP router would be creating a so called PMTUD black hole which causes TCP connections to hang. If the ICMP PTB error gets misrouted, the server that issued the reply doesn't learn that there was a problem with delivery, while the client never gets any response.

Misrouting ICMPv6 errors creates a PMTU black hole.

This problem has been outlined in informational RFC 7690 and in this blog post from CloudFlare.

Starting from Linux v4.4 (and RHEL 7.3, kernel-3.10.0-514.el7) ECMP routing for IPv4 supports anycast setups. The forwarding logic in the ipv4 stack looks at the ICMP error message (see ip_multipath_l3_keys()) and makes a routing decision based on the headers of the offending IP datagram that triggered the error, which is embedded in the ICMP message.

Until recently Linux ipv6 stack was lagging behind in this regard, and ICMP errors could have been misrouted when multipath routing was in use, thus creating the mentioned PMTU black hole.

This is no longer the case in Linux v4.14 and later, where ICMPv6 error messages get routed as expected. Similar to ipv4 stack, the ipv6 routing code does now consider the header fields of a packet carried as ICMPv6 error message payload to do the right thing (ip6_multipath_l3_keys()).

However, this alone is not enough in case of IPv6. It is also needed for the endpoints (servers) behind the ECMP router to be configured to use IPv6 Flow Label reflection. This is because the Flow Label affects the path selection when ECMP is used. As proposed by RFC 6438, the Flow Label field is one of the header fields used as input to the multipath hash. Thanks to it the network flows between two hosts can be spread across multiple paths without inspecting L4 headers.

Under Linux, Flow Label reflection can be enabled per network namespace by setting a kernel parameter:

sysctl -q -w net.ipv6.flowlabel_reflect=1

Patches that enable ICMPv6 error routing with ECMP have been contributed by yours truly (Red Hat):

  1. 22b6722bfa59 ipv6: Add sysctl for per namespace flow label reflection
  2. 29825717123f net: Extend struct flowi6 with multipath hash
  3. 23aebdacb05d ipv6: Compute multipath hash for ICMP errors from offending packet
  4. 956b45318a27 ipv6: Fold rt6_info_hash_nhsfn() into its only caller
  5. b673d6cceae2 ipv6: Use multipath hash from flow info if available

Weighted ECMP with IPv6

With the release of the still fresh v4.16 Linux kernel, ipv6 stack has gained support for weighted ECMP routing (aka non-equal cost multipath routing). It is an upgrade since the last time we've taken a look at it in part 2 of this series, which puts multipath routing support for IPv6 on par with IPv4.

The next-hop mapping algorithm has been switched from Modulo-N to Hash-Threshold thus making it possible to easily add support for next-hop weight by appropriately scaling the range of hash values that map to the hop.

From the code side, rt6_info structure (which corresponds to fib_info in the ipv4 stack) has gained a rt6i_nh_upper_bound field to track the per next-hop multipath hash value range size, while rt6_multipath_select() has been updated to work with these ranges.

We can test it out with a simple 4-namespace setup that simulates a client issuing requests to a couple of web servers. Requests get spread between the servers by an ECMP router that sits in between in an amount directly proportional to the next-hop weight.


ip netns exec router ip -6 route add fc00::1/128 \
  nexthop via fd01::2 dev router-server1 weight $SERVER1_WEIGHT \
  nexthop via fd02::2 dev router-server2 weight $SERVER2_WEIGHT

ip netns exec server1 \
  socat -6 TCP-LISTEN:80,crlf,reuseaddr,fork SYSTEM:"echo server1" &
ip netns exec server2 \
  socat -6 TCP-LISTEN:80,crlf,reuseaddr,fork SYSTEM:"echo server2" &

for ((i = 0; i < NUM_REQUESTS; i++)); do
     ip netns exec client curl -s 'http://[fc00::1]'
done | sort | uniq -c

See for a complete script that sets up and runs the test. The expected result here is that server2 responds to roughly twice as many requests as server1 on a v4.16 or newer kernel:

# uname -r
# ./
* Creating namespaces
* Linking namespaces
* Assigning addresses
* Configuring routes
* Bringing up HTTP servers
* Testing request load balancing
     36 server1
     64 server2
* Killing HTTP servers
* Destroying namespaces

Patches for weighted ECMPv6 support have been contributed by Ido Schimmel (Mellanox):

  1. d7dedee184e7 ipv6: Calculate hash thresholds for IPv6 nexthops
  2. 7696c06a189c ipv6: Use a 31-bit multipath hash
  3. 3d709f69a3e7 ipv6: Use hash-threshold instead of modulo-N
  4. 398958ae48f4 ipv6: Add support for non-equal-cost multipath


[1]Here the ICMP Packet Too Big refers collectively to ICMPv4 Destination Unreachable, Fragmentation Required message as well as to the ICMPv6 Packet Too Big message.
[2]Sometimes referred to as anycast environments.