What is FIB nexthop Exception

Posted by Coreyjames25 on Mon, 07 Mar 2022 07:35:04 +0100

Reprinted from:


The kernel in version 3.6 removes the route cache before FIB query and replaces the next hop cache, which is Past and present life of routing cache It's already said in. This article is about another concept introduced in this version: FIB Nexthop Exception, which is used to record the exceptions of the next hop.

What's the use of it?

By querying the forwarding information table (fib_lookup), the kernel obtains the next hop (fib_nh), so as to obtain the relevant information about this route, including the next hop device nh_dev, next hop gateway nh_gw et al. These items are basically stable. However, when the kernel actually contracts, there may be two variables exception about this route. One is that the path MTU (PMTU) related to this route changes; The second is that ICMP redirection message about this route is received. Since these two changes are not permanent, the kernel saves them in the next hop fib_ NH exception

  • The ICMP REDIRECT message received indicates that the previously sent message has bypassed, and the subsequent message should modify the next hop of the message.
  • The ICMP FRAGNEEDED message received indicates that the previous message is too large, some devices on the path do not accept it, and the source end needs to be segmented.

These two cases are for A single destination address. What do you mean? Taking PMTU as an example, in the following network topology, I configured the following route on host A

ip route add via

It means to contract to all hosts with a destination address of, and the next hop will go

When A sends an IP message with A length of 1500 to C, A network device B in the middle thinks the message is too large, so it sends ICMP FRAGNEEDED message to A, saying that I can only forward messages below 1300. Please fragment the message. What will A do after receiving the message? You can't send all messages that hit this route in the future as 1300, because not all message paths will contain B.

At this time, FIB Nexthop Exception comes in handy. He can record this exception. When the sending message hits this route, if the destination address is not C, it is divided according to 1500. If it is C, it is divided according to 1300.


Using FIB in kernel_ nh_ Exception indicates this exception table entry


struct fib_nh_exception {
    struct fib_nh_exception __rcu    *fnhe_next;  /*  Next FIB in the conflict chain_ nh_ Exception structure */
    __be32                fnhe_daddr;              /*  Exception destination address                     */
    u32                  fnhe_pmtu;                 /*  PMTU of ICMP FRAGNEEDED notification received    */
    __be32                fnhe_gw;                 /*  Gateway for ICMP REDIRECT notifications received      */         
    unsigned long            fnhe_expires;        /*  The expiration time of the exception table entry                */
    struct rtable __rcu        *fnhe_rth;           /*  Associated route cache                     */
    unsigned long            fnhe_stamp;


Each next hop structure fib_ There is a pointer on NH to fnhe_ hash_ Pointer to bucket hash bucket:

struct fib_nh {
    /* code omitted */
    struct fnhe_hash_bucket    *nh_exceptions;

Hash bucket in update_ or_ create_ Created in fnhe, each hash bucket contains 2048 conflict chains, and each conflict chain can store 5 fibs_ nh_ exception

Taking pmtu as an example, after receiving the ICMP FRAGNEEDED message returned by the network device, the following functions will be called to record the notified pmtu value to fib_nh_exception (also recorded in the bound route cache rtable)

static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
    /* */
    if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
        struct fib_nh *nh = &FIB_RES_NH(res);

        update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
                      jiffies + ip_rt_mtu_expires);


After the contract issuing process queries FIB, it will first check whether there is an exception table item with the target address as KEY. If so, use its bound routing cache. If not, use the cache on the next hop

static struct rtable *__mkroute_output(const struct fib_result *res,
                       const struct flowi4 *fl4, int orig_oif,
                       struct net_device *dev_out,
                       unsigned int flags)
    /* code omitted */
    if (fi) {
        struct rtable __rcu prth;
        struct fib_nh *nh = &FIB_RES_NH(*res);

        fnhe = find_exception(nh, fl4->daddr);      //  lookup fl4->daddr Does it exist fib_nh_exception
        if (fnhe)
            prth = &fnhe->fnhe_rth;                  // If so, directly use its bound route cache
        else {
            if (unlikely(fl4->flowi4_flags &
                     FLOWI_FLAG_KNOWN_NH &&
                     !(nh->nh_gw &&
                       nh->nh_scope == RT_SCOPE_LINK))) {
                do_cache = false;
                goto add;
            prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);   // If not, use the route cache cached on the next hop
        rth = rcu_dereference(*prth);
        if (rt_cache_valid(rth)) {
            return rth;