Change of TCP listen socket lookup in Linux

Posted by knucklehead on Sun, 29 Sep 2019 17:21:38 +0200

When the kernel TCP receives the SYN message, it matches the socket in LISTEN state locally to shake hands according to the destination IP and Port of the message.

listen socket lookup prior to version 4.17

The current listener hashtable is hashed by port only. When a process is listening at many IP addresses with the same port (e.g.[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener() performance is degraded to a link list. It is prone to syn attack.

Before version 4.17, TCP listener socket was hash by port and then inserted into the corresponding conflict list. This makes the list longer if many listen sockets listen to the same port, which is even worse after the introduction of REUSEPORT in version 3.9.

Chestnut, for example, has six listener s started on the host, all listening on port 21, so they are placed on the same list (sk_B uses REUSEPORT). If a SYN connection request with the target bit of 1.1.1.4:21 is received at this time, the kernel will always traverse from the beginning to the end to find the matching sk_D when it finds the listenr.

Version 4.17: Find in two hashtable s

Version 4.17 adds a new hashtable(lhash2) to organize listen sockets. This lhash2 uses port+addr as the key to hash, while the original hashtable based on port remains unchanged. In other words, the same listen socket is placed in two hashtables at the same time (exceptionally, if the local address it binds is 0.0.0, it will only be placed in the original hashtable)

lhash2 increases the randomness of hash by adding addr as key. For example, at this point, the original sk_A~C may be hash to other conflict chains, while at the same time, there may be sk_E on other conflict chains that has been hash to lhash 2[0].

Therefore, when looking up listen sockets, the kernel calculates that the sockets satisfying the conditions should belong to the linked list in the two hashtable s according to the port+addr in the SYN message, and then compares the lengths of the two linked lists. If the length of the 1st linked list is not long or less than the length of the 2nd linked list, it will still be in the original way. Search in 1st linked list, otherwise search in 2nd linked list.

                     struct inet_hashinfo *hashinfo,
                     struct sk_buff *skb, int doff,
@@ -217,10 +306,42 @@ struct sock *__inet_lookup_listener(struct net *net,
     unsigned int hash = inet_lhashfn(net, hnum);
     struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
     bool exact_dif = inet_exact_dif_match(net, skb);
+    struct inet_listen_hashbucket *ilb2;
     struct sock *sk, *result = NULL;
     int score, hiscore = 0;
+    unsigned int hash2;
     u32 phash = 0;
 
+    if (ilb->count <= 10 || !hashinfo->lhash2)
+        goto port_lookup;
+
+    /* Too many sk in the ilb bucket (which is hashed by port alone).
+     * Try lhash2 (which is hashed by port and addr) instead.
+     */
+
+    hash2 = ipv4_portaddr_hash(net, daddr, hnum);
+    ilb2 = inet_lhash2_bucket(hashinfo, hash2);
+    if (ilb2->count > ilb->count)
+        goto port_lookup;
+
+    result = inet_lhash2_lookup(net, ilb2, skb, doff,
+                    saddr, sport, daddr, hnum,
+                    dif, sdif);
+    if (result)
+        return result;
+
+    /* Lookup lhash2 with INADDR_ANY */
+
+    hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
+    ilb2 = inet_lhash2_bucket(hashinfo, hash2);
+    if (ilb2->count > ilb->count)
+        goto port_lookup;
+
+    return inet_lhash2_lookup(net, ilb2, skb, doff,
+                  saddr, sport, daddr, hnum,
+                  dif, sdif);
+
+port_lookup:
     sk_for_each_rcu(sk, &ilb->head) {
         score = compute_score(sk, net, hnum, daddr,
                       dif, sdif, exact_dif);

Version 5.0: Find only in 2nd hashtable

In version 5.0, the kernel changed the search mode to search only in 2nd hashtable. The reason for this modification is that if we choose to search in 1st hashtable, the problem of matching the listener of the wildcard address may occur when both the wildcard address (0.0.0.0) and the specific address (such as 1.1.1) listen to the same Port. This is not the 4.17 version of the pot, but the introduction of SO_PORTREUSE in 3.9 version already exists!

Let's see what happened.

The sk_A and sk_B of SO_REUSEPORT are set up to listen on port 21 at the same time. If sk_A is booted later, it will be added to the list header. When a 1.1.1.2:21 message is received, the kernel will find that sk_A already matches, and it will not attempt to match more accurate sk_B! This is obviously not good, because before SO_REUSEPORT enters the kernel, the kernel traverses the entire list, scoring the degree of matching (compute_score) for each socket.

Version 5.0 is modified to search only in 2nd hashtable, and the implementation of compute_score is modified. If the listening address is different from the destination address of the message, the matching will fail directly. Previously, wildcard addresses could be checked directly.

Modification of search method:

struct sock *__inet_lookup_listener(struct net *net,
                     const __be32 daddr, const unsigned short hnum,
                     const int dif, const int sdif)
 {
-    unsigned int hash = inet_lhashfn(net, hnum);
-    struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
-    bool exact_dif = inet_exact_dif_match(net, skb);
     struct inet_listen_hashbucket *ilb2;
-    struct sock *sk, *result = NULL;
-    int score, hiscore = 0;
+    struct sock *result = NULL;
     unsigned int hash2;
-    u32 phash = 0;
-
-    if (ilb->count <= 10 || !hashinfo->lhash2)
-        goto port_lookup;
-
-    /* Too many sk in the ilb bucket (which is hashed by port alone).
-     * Try lhash2 (which is hashed by port and addr) instead.
-     */
 
     hash2 = ipv4_portaddr_hash(net, daddr, hnum);
     ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-    if (ilb2->count > ilb->count)
-        goto port_lookup;
 
     result = inet_lhash2_lookup(net, ilb2, skb, doff,
                     saddr, sport, daddr, hnum,
@@ -335,34 +313,12 @@ struct sock *__inet_lookup_listener(struct net *net,
         goto done;
 
     /* Lookup lhash2 with INADDR_ANY */
-
     hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
     ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-    if (ilb2->count > ilb->count)
-        goto port_lookup;
 
     result = inet_lhash2_lookup(net, ilb2, skb, doff,
-                    saddr, sport, daddr, hnum,
+                    saddr, sport, htonl(INADDR_ANY), hnum,
                     dif, sdif);
-    goto done;
-
-port_lookup:
-    sk_for_each_rcu(sk, &ilb->head) {
-        score = compute_score(sk, net, hnum, daddr,
-                      dif, sdif, exact_dif);
-        if (score > hiscore) {
-            if (sk->sk_reuseport) {
-                phash = inet_ehashfn(net, daddr, hnum,
-                             saddr, sport);
-                result = reuseport_select_sock(sk, phash,
-                                   skb, doff);
-                if (result)
-                    goto done;
-            }
-            result = sk;
-            hiscore = score;
-        }
-    }

Modification of scoring section

@@ -234,24 +234,16 @@ static inline int compute_score(struct sock *sk, struct net *net,
                 const int dif, const int sdif, bool exact_dif)
 {
     int score = -1;
-    struct inet_sock *inet = inet_sk(sk);
-    bool dev_match;
 
-    if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
+    if (net_eq(sock_net(sk), net) && sk->sk_num == hnum &&
             !ipv6_only_sock(sk)) {
-        __be32 rcv_saddr = inet->inet_rcv_saddr;
-        score = sk->sk_family == PF_INET ? 2 : 1;
-        if (rcv_saddr) {
-            if (rcv_saddr != daddr)
-                return -1;
-            score += 4;
-        }
-        dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
-                         dif, sdif);
-        if (!dev_match)
+        if (sk->sk_rcv_saddr != daddr)
+            return -1;
+
+        if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
             return -1;
-        score += 4;
 
+        score = sk->sk_family == PF_INET ? 2 : 1;
         if (sk->sk_incoming_cpu == raw_smp_processor_id())
             score++;
     }

Appendix: Complete patch

inet: Add a 2nd listener hashtable (port+addr) inet_connection_sock.h
inet: Add a 2nd listener hashtable (port+addr) inet_hashtables.h
inet: Add a 2nd listener hashtable (port+addr) inet_hashtables.c
net: tcp: prefer listeners bound to an address inet_hashtables.c

Topics: Linux socket less