Jakub Sitnicki's blog

This blog post takes a look at the current state of Multipath Routing mechanism, also known as Equal Cost Multipath Routing (ECMP), in the Linux v4.11 network stack. If you're running a different kernel - don't worry. Jump to the History dive digression at the end to find out.

With multipath routing you can distribute traffic destined to a single network over several paths (routes). It is an extension of the concept of the conventional routing table, where there is just one network→next hop association (or network→interface association or both). Instead, as we will see next, you can specify multiple next hops for one destination [1].

What are the use cases? Multipath routing can be used to statelessly load balance the flows at Layer 3 as described in RFC 7690 and shown in the figure below.

Figure 1. Stateless load balancing of TCP/UDP flows with multipath routing (ECMP) using anycast scheme (RFC 7690)

Or it could provide improved resiliency to failure by having redundant routes to the end hosts as pictured below.

Figure 2. Redundant routes to the destination using multipath routing (ECMP) (commit a6db4494d218 "net: ipv4: Consider failed nexthops in multipath routes")

How the packet flows get steered when there is more than one path to choose from depends on the route configuration and the details of the implementation in the kernel. Let's take a look at both and highlight some existing differences between the IPv4 and IPv6 stacks in Linux kernel.

Setting it up

The ip route command [2] lets you define a multipath route with a nexthop keyword. For example, to create an IPv6 route to 1001::/64 subnet with two equally-distant next hops, fc00::1 and fc01::1, run:

ip route add 1001::/64 \
   nexthop via fc00::1 \
   nexthop via fc01::1

For an IPv4 example, and a demonstration how we can set next hop preference with weights [3], refer to the excellent iproute2 cheat sheet.

Behind the scenes, when you create a multipath route, the ip command uses a rtnetlink [4] socket to send an RTM_NEWROUTE message to the kernel. This message contains an RTA_MULTIPATH attribute [5] which is an array of struct rtnexthop records, each one corresponding to one next hop:

struct rtnexthop {
      unsigned short          rtnh_len;
      unsigned char           rtnh_flags;
      unsigned char           rtnh_hops;
      int                     rtnh_ifindex;
};

Here is the layout of the complete RTM_NEWROUTE message from our example above with an accompanying hex dump as it gets generated by the ip route command [6]:

Figure 3. Layout of a sample RTM_NEWROUTE message for creating a multipath route specified with the RTA_MULTIPATH attribute.

This message is passed to the kernel space handler for RTM_NEWROUTE messages (inet_rtm_newroute() or inet6_rtm_newroute()), where it is parsed, validated, and transformed into an intermediate form (struct fib_config or struct fib6_config). From it the actual routing table entry (or entries in case of IPv6 stack) is built and inserted into the routing table. These tasks translate into the following data flow through the call chain on the kernel side for IPv4 routes:

RTM_NETLINK → struct fib_config → struct fib_info → struct fib_alias

inet_rtm_newroute()
  fib_table_insert()
    fib_create_info()
      fib_insert_alias()

And a slighty different one for IPv6 routes:

RTM_NETLINK → struct fib6_config → struct rt6_info,
                                   struct rt6_info,
                                   …

inet6_rtm_newroute()
  rtm_to_fib6_config()
  ip6_route_multipath_add()
    fib6_add()
    fib6_add()
    …

History dive

Hash-based multipath routing (where we distribute flows of packets in contrast to individual packets) has been available in Linux kernel since v4.4 for IPv4 [7] and since v3.8 for IPv6 [8]. In this form multipath routing has been also backported and is available in Red Hat Enterprise Linux 7.3 (RHEL kernel 3.10.0-345.el7 or newer).

While the first ever support for multipath routing (IPv4 only), where individual packets were distributed among alternative paths in a random fashion, has been added to the Linux kernel almost 20 years ago! That is in v2.1.68.

In this post we have looked at the potential use cases for multipath routing and also touched on the configuration process. Next time we will dive into the implementation and see how it differs between IPv4 and IPv6 routing subsystem in Linux.

I would like to thank Phil Sutter for his detailed review and feedback on the post.

UPDATE 2017-06-27

added History dive section

fixed grammar & spelling mistakes

added "thank you" note

Footnotes

[1]

Actually each next hop, or multiple next hops in ECMP case, are associated with a (destination, cost/metric) pair. There has even been a recent fix in this area.

The route cost/metric may optionally be specified with metric keyword passed to the ip route command. For an example see section on "Routes with different metric" in the iproute2 cheat sheet. The kernel stores the metric value in fib_priority field from struct fib_info for IPv4 routes, and rt6i_metric field from struct rt6_info for IPv6 routes.

[2]	See man page for ip-route(8).

[3]	As we'll later discover setting next-hop weight only works with IPv4 routes.

[4]	See man page for rtnetlink(7).

[5]	`rtnetlink` attributes are encoded in Legth-Type-Value format and can be nested. See struct rtattr declaration.

[6]	You can sniff and capture `netlink` packets using the netlink monitor device: ip li add mon0 type nlmon ip li set dev mon0 up tcpdump -i mon0 -w netlink.pcap

[7]	See commit 0e884c78ee19 "ipv4: L3 hash-based multipath"

[8]	See commit 51ebd3181572 "ipv6: add support of equal cost multipath (ECMP)"