Linux network protocol stack 3--neighbor subsystem

Posted by Spaceboy on Thu, 03 Feb 2022 12:58:20 +0100

Neighbors can be simply understood as one hop distance on the third floor. The next hop of the route may not be the one hop distance of the direct connection (iterative routing), but it is the one hop distance when it finally reaches the neighbor subsystem.
Usage of linux iterative Routing: https://www.jianshu.com/p/070202b6d3ca

The neighbor subsystem provides the mapping from layer 3 address to layer 2 address, provides layer 2 header cache, accelerates the encapsulation of layer 2 header, and provides the encapsulation of layer 2 message header.
As follows, the neighbor table information expresses the next hop with IP address x.x.x.x, and its mac address is xx:xx:xx:xx:xx:xx: XX, which can be reached through the outgoing interface ethx.

#ip neigh
172.16.10.34 dev eth1 lladdr 52:54:00:8f:77:cd STALE
172.16.100.2 dev eth1 lladdr 00:1e:08:0a:53:01 STALE
192.168.122.1 dev eth2 lladdr 52:54:00:7a:39:1c STALE
172.16.100.3 dev eth1 lladdr 00:1e:08:0a:b2:f7 STALE
172.16.0.2 dev eth1 lladdr 00:1e:08:15:18:65 STALE
172.16.0.1 dev eth1 lladdr 50:c5:8d:b4:3e:81 REACHABLE
1.1.1.1 dev eth0 lladdr 52:54:00:e4:f7:11 PERMANENT
192.168.121.1 dev eth0 lladdr 52:54:00:8a:20:74 STALE
20.1.1.10 dev eth2.100 lladdr 52:54:00:e4:f7:2a STALE

In addition, each table item also has a state. You need to understand its meaning. It is important to understand both the actual environment and reading the code.

/*
 *	Neighbor Cache Entry States.
 */

#define NUD_INCOMPLETE	0x01
#define NUD_REACHABLE	0x02
#define NUD_STALE	0x04
#define NUD_DELAY	0x08
#define NUD_PROBE	0x10
#define NUD_FAILED	0x20

/* Dummy states */
#define NUD_NOARP	0x40
#define NUD_PERMANENT	0x80
#define NUD_NONE	0x00

/* NUD_NOARP & NUD_PERMANENT are pseudostates, they never change
   and make no address resolution or NUD.
   NUD_PERMANENT is also cannot be deleted by garbage collectors.
 */

#define NUD_IN_TIMER	(NUD_INCOMPLETE|NUD_REACHABLE|NUD_DELAY|NUD_PROBE)
#define NUD_VALID	(NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)
#define NUD_CONNECTED	(NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE)

The following state machine diagram is very clear.

NUD_INCOMPLETE: this status refers to the status that the request message has been sent but no response has been received. In this state, the hardware address has not been resolved, so there is no available hardware address. If there is a message to be output to the neighbor, it will be cached.
In this state, a timer will be started. If the neighbor's response is not received when the timer expires, the request message will be sent repeatedly. Otherwise, if the number of times of sending request message reaches the upper limit, it will enter NUD_FAILED.
NUD_REACHABLE: this status and the hardware address of the neighbor are obtained and cached. In this state, first set the output function related to the neighbor item (this state uses the connectd_output of the neighbors_ops structure), and then check whether there are messages to be sent to the neighbor. If the idle time in this state reaches the upper limit, it will enter NUD_STATLE.
NUD_ State: in this state, once there is a message to be output to the neighbor, it will enter NUD_DELAY and output the message. If the idle time in this state reaches the upper limit and the reference count is 1, it will be deleted through the garbage collection mechanism. In this state, the output of the message is not limited and the slow sending process is used
NUD_DELAY: indicates NUD in this state_ The state in which the message sent in the state has been sent and the neighbor's accessibility confirmation is required. In order to receive the response or confirmation from the neighbor, the request will also be retransmitted regularly. If the number of times of sending the request message reaches the upper limit, if the response from the neighbor is received, enter NUD_REACHABLE, otherwise enter NUD_FAILED: in this state, the message output is not limited, and the slow transmission process is used.
NUD_PROBE: transition state, and NUD_ The incomplete status is similar. When the response or confirmation of the neighbor status is not received, the request will be retransmitted regularly until the response, confirmation or attempt to send the request message from the neighbor reaches the upper limit. If the response or confirmation is received, it will enter NUD_REACHABLE: if the attempt to send a request reaches the upper limit, enter NUD_ Failed status. In this status, the message output is not limited, and the slow transmission process is used.
NUD_FAILED: unable to access the status due to no response message received,
NUD_NOARP: identify neighbors without mapping layer 3 address protocol to layer 2 address protocol. For example, some virtual interfaces of three-layer overlay, loopback, etc.
NUD_PERMANENT: set the hardware address of the neighbor table entry to static.

Related data structure

struct neigh_table represents a neighbor protocol interface. At present, the arp of ipv4 and nd of IPv6 are defined by two global variables, ipv4= arp_tbl, ipv6=nd_tbl.

// ipv4= arp_tbl, ipv6=nd_tbl
struct neigh_table {
	int			family;           // ipv4\ipv6
	int			entry_size;    // The size of the neighbor table entry structure, including the information of the neighbor table entry and its key. For ipv4, the neighbor table entry is queried according to the ipv4 address, so = sizeof (neighbor) + 4
	int			key_len;       // It is the neighbor table item key used above, the three-tier address, and arp is the ipv4 address
	__be16			protocol;     // Layer 3 protocol type, ETH_P_IP or ETH_P_IPV6
	__u32			(*hash)(const void *pkey,
					const struct net_device *dev,
					__u32 *hash_rnd);          // Table entry hash function, eg arp_hash
	bool			(*key_eq)(const struct neighbour *, const void *pkey);
	int			(*constructor)(struct neighbour *);
	int			(*pconstructor)(struct pneigh_entry *);
	void			(*pdestructor)(struct pneigh_entry *);
	void			(*proxy_redo)(struct sk_buff *skb);
	char			*id;                   //Buffer pool used to allocate neighbor cache, ARP_ Tab is arp_cache
	struct neigh_parms	parms;      //Store adjustable parameters related to the protocol
	struct list_head	parms_list;   
	int			gc_interval;     // These four are the time parameters of garbage collection
	int			gc_thresh1;
	int			gc_thresh2;
	int			gc_thresh3;
	unsigned long		last_flush;
	struct delayed_work	gc_work;        // Work queue for garbage collection
	struct timer_list 	proxy_timer;
	struct sk_buff_head	proxy_queue;
	atomic_t		entries;                          // Number of all neighbor entries
	rwlock_t		lock;
	unsigned long		last_rand;
	struct neigh_statistics	__percpu *stats;
	struct neigh_hash_table __rcu *nht;
	struct pneigh_entry	**phash_buckets;  //Hash table for storing neighbor entries
};

Struct neighbor defines neighbor table entries, including status, layer 2 and layer 3 protocol addresses, layer 2 header of cache, out interface, and some function pointers.

Output is the data message output function, which is used to output the message to the neighbor. Its callback changes according to the change of state. When the neighbor is reachable, it is connected_output,NUD_ Convert connected to NUD_STALE or NUD_DELAY,neigh_ Prospect will force the confirmation of accessibility by pointing neighbor - > output to neighbor_ Ops - > output, that is, neigh_resolve_output.
neigh_ops defines address resolution request sending function and data message sending function (a general message sending function and a connected state sending function).

struct neighbour {
	struct neighbour __rcu	*next;
	struct neigh_table	*tbl;                    // arp_tbl inversion
	struct neigh_parms	*parms;              //Parameters used to adjust neighbor protocol
	unsigned long		confirmed;         //Record the last time the neighbor is confirmed to be reachable, and the transport layer passes through neigh_confirm to confirm the update, and the neighbor system passes neigh_update update
	unsigned long		updated;           //Record the last time neigh_update update time
	rwlock_t		lock;
	atomic_t		refcnt;
	struct sk_buff_head	arp_queue;      // When sending the first message, a new neighbor item is required, and the sent message is cached in arp_queue, and then solicit() will be called to send the request message.
	unsigned int		arp_queue_len_bytes;
	struct timer_list	timer;
	unsigned long		used;
	atomic_t		probes;
	__u8			flags;
	__u8			nud_state;
	__u8			type;
	__u8			dead;
	seqlock_t		ha_lock;
	unsigned char		ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];    // And stored in primary_ The address of the second layer of the hardware corresponding to the address of the third layer of the key
	struct hh_cache		hh;                                    // HH to cache layer 2 protocol header_ Cache structure, complete L2 header, not just L2 address
	// The output function is used to output the message to the neighbor. Its callback changes according to the state change. When the neighbor is reachable, it is connected_output;
	// NUD_ Convert reachble to NUD_STALE or NUD_DELAY,neigh_ Prospect will force the confirmation of accessibility by pointing neighbor - > output to neighbor_ Ops - > output, that is, neigh_resolve_output	
	int			(*output)(struct neighbour *, struct sk_buff *);   
	const struct neigh_ops	*ops;          // Neighbor term function pointer: it implements layer 3 to layer 2 dev_queue_xmit
	struct rcu_head		rcu;
	struct net_device	*dev;              // Through this network device, you can access the change neighbor, that is, the next jump out interface
	u8			primary_key[0];     //Store the three-layer protocol address used by the hash function, ipv4 or ipv6 address
};

Call when creating a neighbor table entry__ neigh_ When creating, the constructor of neighbor will be called. The arp protocol is arp_constructor, which initializes the neighbor table entries and supports the mounting of different output and ops functions according to the device type and characteristics.

struct neigh_ops {
    int            family;
    void            (*solicit)(struct neighbour *, struct sk_buff *);// Send request message function. When sending a message, the neighbor table entry needs to be updated, and the sent message will be cached in ARP_ In queue, the solicit function is called to send the request message.
    void            (*error_report)(struct neighbour *, struct sk_buff *); // When the neighbor item caches the unsent message and the neighbor item is unreachable, it is called to report the error to the third layer.
    int            (*output)(struct neighbour *, struct sk_buff *); //The general output message function performs neighbor status verification, and the process is better than connected_output slower
    int            (*connected_output)(struct neighbour *, struct sk_buff *);//When neighbors can reach NUD_ When connecting, it must be in the available state of neighbors, and the layer-2 header is directly constructed and encapsulated for transmission.
};

// Devices that do not support header cache
static const struct neigh_ops arp_generic_ops = {
	.family =		AF_INET,
	.solicit =		arp_solicit,
	.error_report =		arp_error_report,
	.output =		neigh_resolve_output,
	.connected_output =	neigh_connected_output,
};

// Devices supporting header cache
static const struct neigh_ops arp_hh_ops = {
	.family =		AF_INET,
	.solicit =		arp_solicit,
	.error_report =		arp_error_report,
	.output =		neigh_resolve_output,
	.connected_output =	neigh_resolve_output,
};
// Headless processing equipment, direct message sending, encapsulated dev_queue_xmit
static const struct neigh_ops arp_direct_ops = {
	.family =		AF_INET,
	.output =		neigh_direct_output,
	.connected_output =	neigh_direct_output,
};

static int arp_constructor(struct neighbour *neigh)
{
	__be32 addr = *(__be32 *)neigh->primary_key;
	struct net_device *dev = neigh->dev;
	struct in_device *in_dev;
	struct neigh_parms *parms;

	rcu_read_lock();
	in_dev = __in_dev_get_rcu(dev);
	if (!in_dev) {
		rcu_read_unlock();
		return -EINVAL;
	}

	neigh->type = inet_addr_type_dev_table(dev_net(dev), dev, addr);

	parms = in_dev->arp_parms;
	__neigh_parms_put(neigh->parms);
	neigh->parms = neigh_parms_clone(parms);
	rcu_read_unlock();

    // Without header operation, there is no need for layer-2 encapsulation and arp. Several output functions call dev directly_ queue_ Xmit send
	if (!dev->header_ops) {
		neigh->nud_state = NUD_NOARP;
		neigh->ops = &arp_direct_ops;
		neigh->output = neigh_direct_output;
	} else {
		/* Good devices (checked by reading texts, but only Ethernet is
		   tested)

		   ARPHRD_ETHER: (ethernet, apfddi)
		   ARPHRD_FDDI: (fddi)
		   ARPHRD_IEEE802: (tr)
		   ARPHRD_METRICOM: (strip)
		   ARPHRD_ARCNET:
		   etc. etc. etc.

		   ARPHRD_IPDDP will also work, if author repairs it.
		   I did not it, because this driver does not work even
		   in old paradigm.
		 */
        // Judging the neighbor type, multicast, broadcast, P2P interface, loopback interface, and the interface with NOARP do not need arp
		if (neigh->type == RTN_MULTICAST) {
			neigh->nud_state = NUD_NOARP;
			arp_mc_map(addr, neigh->ha, dev, 1);
		} else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
			neigh->nud_state = NUD_NOARP;
			memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
		} else if (neigh->type == RTN_BROADCAST ||
			   (dev->flags & IFF_POINTOPOINT)) {
			neigh->nud_state = NUD_NOARP;
			memcpy(neigh->ha, dev->broadcast, dev->addr_len);
		}
        // arp_generic_ops and ARP_ hh_ The biggest difference of OPS is that the latter uses the second layer header cache. The former does not need to use hardware address temporary encapsulation;
        // Only devices that support L2 head cache can mount arp_hh_ops
		if (dev->header_ops->cache)
			neigh->ops = &arp_hh_ops;
		else
			neigh->ops = &arp_generic_ops;
        // Available status mounted connected_output,ops->connected_ The difference between output and Ops - > output is that the former does not need information such as neighbor status
        // Faster verification
		if (neigh->nud_state & NUD_VALID)
			neigh->output = neigh->ops->connected_output;
		else
			neigh->output = neigh->ops->output;
	}
	return 0;
}

The last stage of IP message sending, IP_ finish_ In the output2 function, the data packet is output to the network device through the neighbor subsystem.
1,ip_finish_output2 first queries the neighbor table entry. If it does not exist, call__ neigh_create creates and initializes neighbor table entries;
2. Call DST_ neigh_ The output function sends a data message.

  • Neigh is NUD_CONNECTED status, and the message header is cached. It is sent directly by pasting the header, which is often referred to as fast turn; Neigh - > hhset the timing. One is to enter NUD for the first time_ After connected status, call neigh when sending data message (the normal process should be cached message)_ hh_ Init makes two-layer head; The second is in neigh_ When updating the neigh status in update, call neigh if the layer 2 address changes_ update_ HHS updates the second tier header.
  • Otherwise, call neigh - > output. When neigh enters NUD_CONNECTED , neigh_connect points the function of neigh - > output to neigh - > Ops - > connected_ Output. At this time, the neighbor has saved the second layer address of the neighbor, and it will call dev_ queue_ Fill the L2 header before Xmit and send the package directly. When from NUD_ Convert reachble to NUD_STALE|NUD_DELAY ,neigh_ Prospect will force the confirmation of accessibility by pointing neighbor - > output to neighbor_ Ops - > output, that is, neigh_resolve_output. According to the state of neigh, the process is also very different, and detailed comments are made in the function.
/*
 * This function outputs the data packet to the network device through the neighbor subsystem.
 */
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct dst_entry *dst = skb_dst(skb);
	struct rtable *rt = (struct rtable *)dst;
	struct net_device *dev = dst->dev;
	unsigned int hh_len = LL_RESERVED_SPACE(dev);
	struct neighbour *neigh;
	u32 nexthop;

	if (rt->rt_type == RTN_MULTICAST) {
		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
	} else if (rt->rt_type == RTN_BROADCAST)
		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);

	/* Be paranoid, rather than too clever. */
	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
		struct sk_buff *skb2;

		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
		if (!skb2) {
			kfree_skb(skb);
			return -ENOMEM;
		}
		if (skb->sk)
			skb_set_owner_w(skb2, skb->sk);
		consume_skb(skb);
		skb = skb2;
	}

	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
		int res = lwtunnel_xmit(skb);

		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
			return res;
	}

	rcu_read_lock_bh();
	// Remove the next hop from the route. In two cases, specify the RT of the next hop from the route_ The gateway takes the dst ip of the message if no route is specified
	// This is the difference between specifying nexthop and not specifying when configuring the route. If not specified, the arp message requesting dst ip mac will be constructed later
	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);

	// Officially enter the neighbor subsystem. The essence of sending process and routing is to find the next hop, which is managed by the neighbor subsystem
	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
	if (unlikely(!neigh))
		neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
	if (!IS_ERR(neigh)) {
		int res = dst_neigh_output(dst, neigh, skb);

		rcu_read_unlock_bh();
		return res;
	}
	rcu_read_unlock_bh();

	net_dbg_ratelimited("%s: No header cache and no neighbour!\n",
			    __func__);
	kfree_skb(skb);
	return -EINVAL;
}


static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
				   struct sk_buff *skb)
{
	const struct hh_cache *hh;

	if (dst->pending_confirm) {
		unsigned long now = jiffies;

		dst->pending_confirm = 0;
		/* avoid dirtying neighbour */
		if (n->confirmed != now)
			n->confirmed = now;
	}
	/*
    // neigh Is NUD_CONNECTED status, and the message header is cached. It is sent directly by pasting the header, which is often referred to as the fast turn mode;
    //Otherwise, it needs to be based on the current nud_state, call different functions to process
    // * When neigh enters NUD_CONNECTED , neigh_connect points the function of neigh - > output to neigh - > Ops - > connected_ output,
         That is, neigh_connected_output, which will call dev_ queue_ Fill the L2 header before Xmit and send the package directly.
       * When from NUD_ Convert reachble to NUD_STALE ,neigh_ Prospect will force the confirmation of accessibility through 
         neighbor->output Point to neigh_ Ops - > output, that is, neigh_resolve_output. 
     */
	hh = &n->hh;
	if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
		return neigh_hh_output(hh, skb);
	else
		return n->output(n, skb);
}
int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
{
	int rc = 0;
    /* neigh_event_send,Complete the neighbor table item status verification after receiving the data message. At the same time, as a part of the neighbor subsystem state machine, it is an event entry to complete the operation of the state machine
        If the data message is received for the first time, the status NUD will be triggered_ NONE -> NUD_ Incomplete, cache message, send arp request, etc.
        See the notes in the function for the detailed process
        The function returns 1 to indicate that the data message cannot be sent directly (discarded or cached), and 0 to indicate that the data message can be sent
     */ 
	if (!neigh_event_send(neigh, skb)) {
		int err;
		struct net_device *dev = neigh->dev;
		unsigned int seq;
        // Make layer-2 header cache. The supported device cache functions are not empty, such as eth port and eth_header_cache
		if (dev->header_ops->cache && !neigh->hh.hh_len)
			neigh_hh_init(neigh);

		do {
		    // Construct and encapsulate the layer-2 header. Note that the layer-2 header cache above is in DST_ neigh_ Used in the output function,
		    // Only when there is no cache can we get here
			__skb_pull(skb, skb_network_offset(skb));
			seq = read_seqbegin(&neigh->ha_lock);
			err = dev_hard_header(skb, dev, ntohs(skb->protocol),
					      neigh->ha, NULL, skb->len);
		} while (read_seqretry(&neigh->ha_lock, seq));

		if (err >= 0)
		    // Send data message
			rc = dev_queue_xmit(skb);
		else
			goto out_kfree_skb;
	}
out:
	return rc;
out_kfree_skb:
	rc = -EINVAL;
	kfree_skb(skb);
	goto out;
}
EXPORT_SYMBOL(neigh_resolve_output);


/* As fast as possible without hh cache */

int neigh_connected_output(struct neighbour *neigh, struct sk_buff *skb)
{
	struct net_device *dev = neigh->dev;
	unsigned int seq;
	int err;
    // In the connected state, state verification is not required
	do {
	    // Construct and encapsulate the layer-2 header. Note that the layer-2 header cache above is in DST_ neigh_ Used in the output function,
		// Only when there is no cache can we get here
		__skb_pull(skb, skb_network_offset(skb));
		seq = read_seqbegin(&neigh->ha_lock);
		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
				      neigh->ha, NULL, skb->len);
	} while (read_seqretry(&neigh->ha_lock, seq));

	if (err >= 0)
	    // Send data message
		err = dev_queue_xmit(skb);
	else {
		err = -EINVAL;
		kfree_skb(skb);
	}
	return err;
}
EXPORT_SYMBOL(neigh_connected_output);

Up there, neigh_ resolve_ Neigh of output call_ event_ Send is more important. It involves the migration of state machines in some unstable states. In particular, the neighbor address resolution process triggered by data message is also in it.
What is a state machine? State machine elements include event, state and action, which can be summarized as: in a certain state, receiving an event triggers an action to migrate the action to a new state. The state machine can be viewed from the event entry. Some event update status and entry of neighbor subsystem, such as:
neigh_timer_handler, state machine update caused by timer timeout event
neigh_event_send, state machine update caused by data message receiving event
neigh_update, the update of the state machine caused by the protocol message receiving event, which is actually inaccurate. The direct state operation is in the function calling it, such as receiving the arp request/reply message (arp_process), statically configuring the ARP table item (neigh_add), etc.

static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
	unsigned long now = jiffies;
	
	if (neigh->used != now)
		neigh->used = now;
	// These states of neighbor states are stable states for data message receiving events. There is no need to do any action here, especially the first one,
	// NUD_ DELAY|NUD_ The probe status needs to wait for the arrival of the neighbor address resolution response message or the timer timeout to confirm the next status.
	if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE)))
		return __neigh_event_send(neigh, skb);
	return 0;
}

int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
	int rc;
	bool immediate_probe = false;

	write_lock_bh(&neigh->lock);

	rc = 0;
	// The neighbor status is available and returns 0, which can be used for message sending
	if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
		goto out_unlock_bh;
	// dead, unavailable, release message
	if (neigh->dead)
		goto out_dead;
    // Here is a simple state machine processing, plus neigh_timer_handler
	if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
	    // NUD_NONE status branch

	    // Corresponding to mcast under / proc/sys/net/ipv4/neigh/eth1 /_ Solicit and app_solicit configuration,
	    // Control the number of times to send neighbor address detection message. If it is not 0, it can be detected, otherwise nud_state = NUD_FAILED and release message
		if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
		    NEIGH_VAR(neigh->parms, APP_PROBES)) {
			unsigned long next, now = jiffies;

			atomic_set(&neigh->probes,
				   NEIGH_VAR(neigh->parms, UCAST_PROBES));
		    // NUD_NONE -> NUD_INCOMPLETE
			neigh->nud_state     = NUD_INCOMPLETE;
			neigh->updated = now;
			next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
					 HZ/2);
			neigh_add_timer(neigh, next);
			// For the first time, we will start arp request immediately
			immediate_probe = true;
		} else {
		    // NUD_NONE -> NUD_FAILED
			neigh->nud_state = NUD_FAILED;
			neigh->updated = jiffies;
			write_unlock_bh(&neigh->lock);

			kfree_skb(skb);
			return 1;
		}
	} else if (neigh->nud_state & NUD_STALE) {
	    // NUD_ When the state needs to send a message, it immediately switches to NUD_DELAY status and trigger the timer (processing function = neigh_timer_handler)
	    // Will call neigh_ Probe -- > neigh - > Ops - > solicit construct ARP request
	    // NUD_ CONNECTED | NUD_ DELAY | NUD_ PROBE | NUD_ All data messages can be sent normally under the state of stand, and there is no need to cache messages
		neigh_dbg(2, "neigh %p is delayed\n", neigh);
		neigh->nud_state = NUD_DELAY;
		neigh->updated = jiffies;
		neigh_add_timer(neigh, jiffies +
				NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME));
	}
    // NUD_ In the incomplete state, cache the data message. At this time, the arp request message has been sent. Wait for the reply or the timer expires
    // Back to 1, do nothing outside
	if (neigh->nud_state == NUD_INCOMPLETE) {
		if (skb) {
			while (neigh->arp_queue_len_bytes + skb->truesize >
			       NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES)) {
				struct sk_buff *buff;

				buff = __skb_dequeue(&neigh->arp_queue);
				if (!buff)
					break;
				neigh->arp_queue_len_bytes -= buff->truesize;
				kfree_skb(buff);
				NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
			}
			skb_dst_force(skb);
			__skb_queue_tail(&neigh->arp_queue, skb);
			neigh->arp_queue_len_bytes += skb->truesize;
		}
		rc = 1;
	}
out_unlock_bh:
	if (immediate_probe)
	    // neigh_probe calls neigh - > Ops - > solicit to send address resolution request message,
	    // immediate_ When probe = false (such as switching the NUD_DELAY state above), the timer times out (neigh_timer_handler),
	    // Neigh is also called_ probe
		neigh_probe(neigh);
	else
		write_unlock(&neigh->lock);
	local_bh_enable();
	return rc;

out_dead:
	if (neigh->nud_state & NUD_STALE)
		goto out_unlock_bh;
	write_unlock_bh(&neigh->lock);
	kfree_skb(skb);
	return 1;
}
EXPORT_SYMBOL(__neigh_event_send);

Taking arp protocol as an example, the status of neighbor changes after receiving the protocol message.
arp_ The process function processes an ARP message in the kernel, which is briefly summarized as follows:
1. Received arp request:
   1) if the tip is local, use the mac of the packet receiving interface device (not the interface where the tip is located) to answer arp reply and learn the arp table entry of sip.
   2) if the tip address type is not local (the tip route is forwarding type) and the receiving device supports forwarding, in this case, if the proxy arp function is enabled, do arp proxy processing, that is, return arp reply with its own mac address to lead the traffic to the device (generally the gateway device). And learn the arp table item of sip.
   3) if the tip is not a local ip, the receiving device is not configured with arp proxy, or even if the tip cannot find the route locally, only the free arp received will trigger the arp entry of learning sip. In other cases, table items will not be created to prevent a large number of table items, but they are not actually used.
   4) after receiving the arp request, the existing or newly created neighbor table entries of the machine will be changed to stale status.
2. The received item of reply is nuard_ Reachable status.
3. Actions accompanying updating neighbor table entry status:
   1) replace the neigh - > output data message output function, NUD_CONNECTED points to Ops - > connected_ Output, other points to Ops - > output;
   2) update the layer-2 address and layer-2 header cache of the neighbor table entry for the new layer-2 address in the arp message (whether it is the smac of sip or the dmac of tip);
   3) reset status and timeout timer;
   4) status if from! NUD_ VALID --> NUD_ Valid status, indicating that the neighbor is never available to available, and will send the message cached on the table item.

static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct in_device *in_dev = __in_dev_get_rcu(dev);
	struct arphdr *arp;
	unsigned char *arp_ptr;
	struct rtable *rt;
	unsigned char *sha;
	__be32 sip, tip;
	u16 dev_type = dev->type;
	int addr_type;
	struct neighbour *n;
	struct dst_entry *reply_dst = NULL;
	bool is_garp = false;

	/* arp_rcv below verifies the ARP header and verifies the device
	 * is ARP'able.
	 */

	if (!in_dev)
		goto out_free_skb;

	arp = arp_hdr(skb);
	// arp message validity verification
	switch (dev_type) {
	default:
		if (arp->ar_pro != htons(ETH_P_IP) ||
		    htons(dev_type) != arp->ar_hrd)
			goto out_free_skb;
		break;
	case ARPHRD_ETHER:
	case ARPHRD_FDDI:
	case ARPHRD_IEEE802:
		/*
		 * ETHERNET, and Fibre Channel (which are IEEE 802
		 * devices, according to RFC 2625) devices will accept ARP
		 * hardware types of either 1 (Ethernet) or 6 (IEEE 802.2).
		 * This is the case also of FDDI, where the RFC 1390 says that
		 * FDDI devices should accept ARP hardware of (1) Ethernet,
		 * however, to be more robust, we'll accept both 1 (Ethernet)
		 * or 6 (IEEE 802.2)
		 */
		if ((arp->ar_hrd != htons(ARPHRD_ETHER) &&
		     arp->ar_hrd != htons(ARPHRD_IEEE802)) ||
		    arp->ar_pro != htons(ETH_P_IP))
			goto out_free_skb;
		break;
	case ARPHRD_AX25:
		if (arp->ar_pro != htons(AX25_P_IP) ||
		    arp->ar_hrd != htons(ARPHRD_AX25))
			goto out_free_skb;
		break;
	case ARPHRD_NETROM:
		if (arp->ar_pro != htons(AX25_P_IP) ||
		    arp->ar_hrd != htons(ARPHRD_NETROM))
			goto out_free_skb;
		break;
	}

	/* Understand only these message types */

	if (arp->ar_op != htons(ARPOP_REPLY) &&
	    arp->ar_op != htons(ARPOP_REQUEST))
		goto out_free_skb;

/*
 *	Extract fields
 */
	// arp header information extraction
	arp_ptr = (unsigned char *)(arp + 1);
	sha	= arp_ptr;
	arp_ptr += dev->addr_len;
	memcpy(&sip, arp_ptr, 4);
	arp_ptr += 4;
	switch (dev_type) {
#if IS_ENABLED(CONFIG_FIREWIRE_NET)
	case ARPHRD_IEEE1394:
		break;
#endif
	default:
		arp_ptr += dev->addr_len;
	}
	memcpy(&tip, arp_ptr, 4);
/*
 *	Check for bad requests for 127.x.x.x and requests for multicast
 *	addresses.  If this is one such, delete it.
 */
	if (ipv4_is_multicast(tip) ||
	    (!IN_DEV_ROUTE_LOCALNET(in_dev) && ipv4_is_loopback(tip)))
		goto out_free_skb;

 /*
  *	For some 802.11 wireless deployments (and possibly other networks),
  *	there will be an ARP proxy and gratuitous ARP frames are attacks
  *	and thus should not be accepted.
  */
	if (sip == tip && IN_DEV_ORCONF(in_dev, DROP_GRATUITOUS_ARP))
		goto out_free_skb;

/*
 *     Special case: We must set Frame Relay source Q.922 address
 */
	if (dev_type == ARPHRD_DLCI)
		sha = dev->broadcast;


	if (arp->ar_op == htons(ARPOP_REQUEST) && skb_metadata_dst(skb))
		reply_dst = (struct dst_entry *)
			    iptunnel_metadata_reply(skb_metadata_dst(skb),
						    GFP_ATOMIC);

	/* Special case: IPv4 duplicate address detection packet (RFC2131) */
	// sip==0, which is used by the dhcp server to detect duplicate addresses distributed by it
	if (sip == 0) {
		if (arp->ar_op == htons(ARPOP_REQUEST) &&
		    inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL &&
		    !arp_ignore(in_dev, sip, tip))
			arp_send_dst(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip,
				     sha, dev->dev_addr, sha, reply_dst);
		goto out_consume_skb;
	}
    // arp request message needs to be able to find the route of tip. Normally, tip should be local ip
	if (arp->ar_op == htons(ARPOP_REQUEST) &&
	    ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {

		rt = skb_rtable(skb);
		addr_type = rt->rt_type;
        // If it is a local route, it indicates the layer-2 address of the local IP address requested
		if (addr_type == RTN_LOCAL) {
			int dont_send;
            // The two arp control feature filters have corresponding system parameters
			dont_send = arp_ignore(in_dev, sip, tip);
			if (!dont_send && IN_DEV_ARPFILTER(in_dev))
				dont_send = arp_filter(sip, tip, dev);
			if (!dont_send) {
			    // neigh_event_ns will learn the neighbor table entry of src ip, and create or update the neighbor table entry. Update neighbor to stale status
				n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
				if (n) {
				    // Send arp reply. No matter which dev the tip is actually in, lladdr will use the message to receive dev
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				    // Neigh - > refcnt -- at least 1 is left in the newly created (neigh_alloc and _neigh_create hold once respectively)
					neigh_release(n);
				}
			}
			goto out_consume_skb;
		} else if (IN_DEV_FORWARD(in_dev)) {
		    /* The address type is not local, the route of TIP is forwarding type, and the receiving device supports forwarding. If the proxy arp function is enabled, it will be used as arp proxy
                 That is, use your own mac address back to arp reply to lead the traffic to the device (usually the gateway device)
                 net.ipv4.conf.xx.proxy_arp == Enable arp
                 net.ipv4.conf.xx.proxy_arp_pvlan The packet responding to the agent arp is sent from the interface receiving the agent arp request 
		    */
			if (addr_type == RTN_UNICAST  &&
			    (arp_fwd_proxy(in_dev, dev, rt) ||
			     arp_fwd_pvlan(in_dev, dev, rt, sip, tip) ||
			     (rt->dst.dev != dev &&
			      pneigh_lookup(&arp_tbl, net, &tip, dev, 0)))) {
			    // The same is to learn the neighbor table entries of src ip, and create or update the neighbor table entries
				n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
				if (n)
					neigh_release(n);

				if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
				    skb->pkt_type == PACKET_HOST ||
				    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) {
				    // Send agent arp reply
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				} else {
				    // arp proxy delay processing, skb in proxy_queue, its timer
					pneigh_enqueue(&arp_tbl,
						       in_dev->arp_parms, skb);
					goto out_free_dst;
				}
				goto out_consume_skb;
			}
		}
	}

	/* Update our ARP tables */
    // 1. arp reply processing to update neighbor status
    // 2. Some cases of arp request, such as not finding the route of tip, non local tip, but not opening arp proxy. In these cases, you do not need to reply back, but you can also update the neighbor status
	n = __neigh_lookup(&arp_tbl, &sip, dev, 0);

	if (IN_DEV_ARP_ACCEPT(in_dev)) {
		unsigned int addr_type = inet_addr_type_dev_table(net, dev, sip);

		/* Unsolicited ARP is not accepted by default.
		   It is possible, that this option should be enabled for some
		   devices (strip is candidate)
		 */
		is_garp = arp->ar_op == htons(ARPOP_REQUEST) && tip == sip &&
			  addr_type == RTN_UNICAST;
        // If the local neighbor table entry does not exist, the arp reply package will trigger the creation of a new neighbor,
        // When arp request comes here, either the tip route cannot be found or the tip is not local. It will only create the neighbor table entry of free arp request,
        // Others are ignored, otherwise a large number of table entries may be created, but they are not actually used
		if (!n &&
		    ((arp->ar_op == htons(ARPOP_REPLY)  &&
				addr_type == RTN_UNICAST) || is_garp))
			n = __neigh_lookup(&arp_tbl, &sip, dev, 1);
	}

	if (n) {
		int state = NUD_REACHABLE;
		int override;

		/* If several different ARP replies follows back-to-back,
		   use the FIRST one. It is possible, if several proxy
		   agents are active. Taking the first reply prevents
		   arp trashing and chooses the fastest router.
		 */
		override = time_after(jiffies,
				      n->updated +
				      NEIGH_VAR(n->parms, LOCKTIME)) ||
			   is_garp;

		/* Broadcast replies and request packets
		   do not assert neighbour reachability.
		 */
		// reply package triggers neighbor update to NUD_REACHABLE, update request package to NUD_STALE
		if (arp->ar_op != htons(ARPOP_REPLY) ||
		    skb->pkt_type != PACKET_HOST)
			state = NUD_STALE;
		neigh_update(n, sha, state,
			     override ? NEIGH_UPDATE_F_OVERRIDE : 0);
		neigh_release(n);
	}

out_consume_skb:
	consume_skb(skb);

out_free_dst:
	dst_release(reply_dst);
	return NET_RX_SUCCESS;

out_free_skb:
	kfree_skb(skb);
	return NET_RX_DROP;
}

/*
    Update neighbor status and reset timer
    Update the L2 address and L2 header cache,
    Update the neighbor data message sending function,
    After the neighbor is available, send the cached data message
*/
int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
		 u32 flags)
{
	u8 old;
	int err;
	int notify = 0;
	struct net_device *dev;
	int update_isrouter = 0;

	write_lock_bh(&neigh->lock);

	dev    = neigh->dev;
	old    = neigh->nud_state;
	err    = -EPERM;

	if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
	    (old & (NUD_NOARP | NUD_PERMANENT)))
		goto out;
	if (neigh->dead)
		goto out;
    // Enter NUD_FAILED status, release some resources (timer, cache message)
	if (!(new & NUD_VALID)) {
		neigh_del_timer(neigh);
		if (old & NUD_CONNECTED)
			neigh_suspect(neigh);
		neigh->nud_state = new;
		err = 0;
		notify = old & NUD_VALID;
		if ((old & (NUD_INCOMPLETE | NUD_PROBE)) &&
		    (new & NUD_FAILED)) {
		    // In the state of (nud_complete | nud_probe), data message or arp message may be cached and released here
			neigh_invalidate(neigh);
			notify = 1;
		}
		goto out;
	}

	/* Compare new lladdr with cached one */
	// Get a new second floor address
	if (!dev->addr_len) {
		/* First case: device needs no address. */
		lladdr = neigh->ha;
	} else if (lladdr) {
		/* The second case: if something is already cached
		   and a new address is proposed:
		   - compare new & old
		   - if they are different, check override flag
		 */
		if ((old & NUD_VALID) &&
		    !memcmp(lladdr, neigh->ha, dev->addr_len))
			lladdr = neigh->ha;
	} else {
		/* No address is supplied; if we know something,
		   use it, otherwise discard the request.
		 */
		err = -EINVAL;
		if (!(old & NUD_VALID))
			goto out;
		lladdr = neigh->ha;
	}

	if (new & NUD_CONNECTED)
		neigh->confirmed = jiffies;
	neigh->updated = jiffies;

	/* If entry was valid and address is not changed,
	   do not change entry state, if new one is STALE.
	 */
	err = 0;
	update_isrouter = flags & NEIGH_UPDATE_F_OVERRIDE_ISROUTER;
	if (old & NUD_VALID) {
		if (lladdr != neigh->ha && !(flags & NEIGH_UPDATE_F_OVERRIDE)) {
			update_isrouter = 0;
			if ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) &&
			    (old & NUD_CONNECTED)) {
				lladdr = neigh->ha;
				new = NUD_STALE;
			} else
				goto out;
		} else {
			if (lladdr == neigh->ha && new == NUD_STALE &&
			    !(flags & NEIGH_UPDATE_F_ADMIN))
				new = old;
		}
	}
    // reset timer, modify neigh status
	if (new != old) {
		neigh_del_timer(neigh);
		if (new & NUD_PROBE)
			atomic_set(&neigh->probes, 0);
		if (new & NUD_IN_TIMER)
			neigh_add_timer(neigh, (jiffies +
						((new & NUD_REACHABLE) ?
						 neigh->parms->reachable_time :
						 0)));
		neigh->nud_state = new;
		notify = 1;
	}
    // Update L2 address and L2 header cache
	if (lladdr != neigh->ha) {
		write_seqlock(&neigh->ha_lock);
		memcpy(&neigh->ha, lladdr, dev->addr_len);
		write_sequnlock(&neigh->ha_lock);
		neigh_update_hhs(neigh);
		if (!(new & NUD_CONNECTED))
			neigh->confirmed = jiffies -
				      (NEIGH_VAR(neigh->parms, BASE_REACHABLE_TIME) << 1);
		notify = 1;
	}
	if (new == old)
		goto out;
	// Replace the neigh - > output function, NUD_CONNECTED points to Ops - > connected_ Output, other points to Ops - > output
	if (new & NUD_CONNECTED)
		neigh_connect(neigh);
	else
		neigh_suspect(neigh);
	if (!(old & NUD_VALID)) {
	    // Come here. new is NUD_VALID, old if yes! NUD_VALID indicates that the neighbor is unavailable to available and can send cached data messages
		struct sk_buff *skb;

		/* Again: avoid dead loop if something went wrong */

		while (neigh->nud_state & NUD_VALID &&
		       (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) {
			struct dst_entry *dst = skb_dst(skb);
			struct neighbour *n2, *n1 = neigh;
			write_unlock_bh(&neigh->lock);

			rcu_read_lock();

			/* Why not just use 'neigh' as-is?  The problem is that
			 * things such as shaper, eql, and sch_teql can end up
			 * using alternative, different, neigh objects to output
			 * the packet in the output path.  So what we need to do
			 * here is re-lookup the top-level neigh in the path so
			 * we can reinject the packet there.
			 */
			n2 = NULL;
			if (dst) {
				n2 = dst_neigh_lookup_skb(dst, skb);
				if (n2)
					n1 = n2;
			}
			n1->output(n1, skb);
			if (n2)
				neigh_release(n2);
			rcu_read_unlock();

			write_lock_bh(&neigh->lock);
		}
		__skb_queue_purge(&neigh->arp_queue);
		neigh->arp_queue_len_bytes = 0;
	}
out:
	if (update_isrouter) {
		neigh->flags = (flags & NEIGH_UPDATE_F_ISROUTER) ?
			(neigh->flags | NTF_ROUTER) :
			(neigh->flags & ~NTF_ROUTER);
	}
	write_unlock_bh(&neigh->lock);

	if (notify)
		neigh_update_notify(neigh);

	return err;
}
EXPORT_SYMBOL(neigh_update);


Topics: Linux network Network Protocol