Socket kernel data structure

Posted by drorgo on Wed, 03 Nov 2021 22:43:41 +0100

In the previous section, we talked about the calling process of Socket in TCP and UDP scenarios. In this section, we will follow this process to the kernel to find out what data structures have been created and what things have been done in the kernel.

Parsing socket function

Let's start with the Socket system call.

SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
	int retval;
	struct socket *sock;
	int flags;
......
	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 
	retval = sock_create(family, type, protocol, &sock);
......
	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
......
	return retval;
}

The code here is easy to understand. Socket system call will call sock_create creates a struct socket structure, and then uses the socket_ map_ FD corresponds to the file descriptor.

When creating a Socket, there are three parameters.

One is family, which represents the address family. Not all sockets need to communicate through IP. There are other communication methods. For example, in the following definition, domain sockets communicate through local files and do not require an IP address. However, IP address is only the most commonly used mode, so we focus on this mode here.

#define AF_UNIX 1/* Unix domain sockets */
#define AF_INET 2/* Internet IP Protocol */

The second parameter is type, that is, the type of Socket. Types are relatively few.

The third parameter is protocol. The number of protocols is relatively large, that is, multiple protocols will belong to the same type.

There are three commonly used Socket types: SOCK_STREAM,SOCK_DGRAM and SOCK_RAW.

enum sock_type {
SOCK_STREAM = 1,
SOCK_DGRAM = 2,
SOCK_RAW = 3,
......
}

SOCK_STREAM is a data flow oriented protocol, IPPROTO_TCP is of this type. SOCK_DGRAM is a datagram oriented protocol, IPPROTO_UDP is of this type. If you look inside the kernel, IPPROTO_ICMP also belongs to this type. SOCK_RAW is the original IP packet, IPPROTO_IP is of this type.

In this section, we focus on SOCK_STREAM type and IPPROTO_TCP protocol.

In order to manage the three classification levels of family, type and protocol, the kernel will create corresponding data structures.

Next, we open the sock_ Take a look at the create function. It calls__ sock_create.

int __sock_create(struct net *net, int family, int type, int protocol,
			 struct socket **res, int kern)
{
	int err;
	struct socket *sock;
	const struct net_proto_family *pf;
......
	sock = sock_alloc();
......
	sock->type = type;
......
	pf = rcu_dereference(net_families[family]);
......
	err = pf->create(net, sock, protocol, kern);
......
	*res = sock;
 
	return 0;
}

First, a struct socket structure is allocated. Next, we will use the family parameter. Here is a net_families array, we can find the corresponding struct net with the family parameter as the subscript_ proto_ family.

/* Supported address families. */
#define AF_UNSPEC	0
#define AF_UNIX		1	/* Unix domain sockets 		*/
#define AF_LOCAL	1	/* POSIX name for AF_UNIX	*/
#define AF_INET		2	/* Internet IP Protocol 	*/
......
#define AF_INET6	10	/* IP version 6			*/
......
#define AF_MPLS		28	/* MPLS */
......
#define AF_MAX		44	/* For now.. */
#define NPROTO		AF_MAX
 
struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;

We can find net_ Definition of families. Each address family has an item in this array, and the content in it is net_proto_family. Each address family has its own net_proto_family, net of IP address family_ proto_ Family is defined as follows. The most important thing is that the create function points to inet_create.

//net/ipv4/af_inet.c
static const struct net_proto_family inet_family_ops = {
	.family = PF_INET,
	.create = inet_create,// This is used for socket system call creation
......
}

Let's go back to the function__ sock_create. Next, in this, this inet_create will be called.

static int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
{
	struct sock *sk;
	struct inet_protosw *answer;
	struct inet_sock *inet;
	struct proto *answer_prot;
	unsigned char answer_flags;
	int try_loading_module = 0;
	int err;
 
	/* Look for the requested type/protocol pair. */
lookup_protocol:
	list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
		err = 0;
		/* Check the non-wild match. */
		if (protocol == answer->protocol) {
			if (protocol != IPPROTO_IP)
				break;
		} else {
			/* Check for the two wild cases. */
			if (IPPROTO_IP == protocol) {
				protocol = answer->protocol;
				break;
			}
			if (IPPROTO_IP == answer->protocol)
				break;
		}
		err = -EPROTONOSUPPORT;
	}
......
	sock->ops = answer->ops;
	answer_prot = answer->prot;
	answer_flags = answer->flags;
......
	sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);
......
	inet = inet_sk(sk);
	inet->nodefrag = 0;
	if (SOCK_RAW == sock->type) {
		inet->inet_num = protocol;
		if (IPPROTO_RAW == protocol)
			inet->hdrincl = 1;
	}
	inet->inet_id = 0;
	sock_init_data(sock, sk);
 
	sk->sk_destruct	   = inet_sock_destruct;
	sk->sk_protocol	   = protocol;
	sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;
 
	inet->uc_ttl	= -1;
	inet->mc_loop	= 1;
	inet->mc_ttl	= 1;
	inet->mc_all	= 1;
	inet->mc_index	= 0;
	inet->mc_list	= NULL;
	inet->rcv_tos	= 0;
 
	if (inet->inet_num) {
		inet->inet_sport = htons(inet->inet_num);
		/* Add to protocol hash chains. */
		err = sk->sk_prot->hash(sk);
	}
 
	if (sk->sk_prot->init) {
		err = sk->sk_prot->init(sk);
	}
......
}

In INET_ In create, we will first see a circular list_for_each_entry_rcu. Here, the second parameter type comes into play. Because the loop looks at inetsw [Sock - > type].

Inetsw here is also an array, with type as the subscript and struct INET as the content_ Protosw is a protocol, that is, the inetsw array has one item for each type, which is the protocol of this type.

static struct list_head inetsw[SOCK_MAX];
 
static int __init inet_init(void)
{
......
	/* Register the socket-side information for inet_create. */
	for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
		INIT_LIST_HEAD(r);
	for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
		inet_register_protosw(q);
......
}

inetsw array is initialized during system initialization, just like the implementation in the following code.

First, a loop initializes each item of the inetsw array as a linked list. As we said earlier, a type will contain multiple protocol s, so we need a linked list. The next loop is the inetsw_array register in inetsw array. inetsw_ The definition of array is as follows. The contents of this array are very important and will be used later.

static struct inet_protosw inetsw_array[] =
{
	{
		.type =       SOCK_STREAM,
		.protocol =   IPPROTO_TCP,
		.prot =       &tcp_prot,
		.ops =        &inet_stream_ops,
		.flags =      INET_PROTOSW_PERMANENT |
			      INET_PROTOSW_ICSK,
	},
	{
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_UDP,
		.prot =       &udp_prot,
		.ops =        &inet_dgram_ops,
		.flags =      INET_PROTOSW_PERMANENT,
     },
     {
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_ICMP,
		.prot =       &ping_prot,
		.ops =        &inet_sockraw_ops,
		.flags =      INET_PROTOSW_REUSE,
     },
     {
        .type =       SOCK_RAW,
	    .protocol =   IPPROTO_IP,	/* wild card */
	    .prot =       &raw_prot,
	    .ops =        &inet_sockraw_ops,
	    .flags =      INET_PROTOSW_REUSE,
     }
}

Let's go back to inet_create list_for_each_entry_rcu loop. It's easy to understand here. This is to find the list of this type according to the type in the inetsw array, and then compare the struct INET in the list in turn_ Whether the protocol of protosw is the user specified protocol; If yes, you will get a struct INET that conforms to the family - > type - > protocol specified by the user_ Protosw * answer object.

Next, the OPS member variable of struct socket * socket is assigned to the OPS of answer. For TCP, it is inet_stream_ops. Any subsequent user's operation on this socket is through inet_stream_ops.

Next, we create a struct sock *sk object. It's confusing here. Socket and socket look almost the same, which is easy to be confused. It should be explained here that socket is used to provide interface to users and is associated with file system. The sock is responsible for docking down the kernel network protocol stack.

In SK_ In the alloc function, struct INET_ TCP with protosw * answer structure_ Prot is assigned to sk of struct sock *sk_prot member. tcp_ The definition of prot is as follows. There are many functions defined in it, which are the actions of the kernel protocol stack under the sock.

struct proto tcp_prot = {
	.name			= "TCP",
	.owner			= THIS_MODULE,
	.close			= tcp_close,
	.connect		= tcp_v4_connect,
	.disconnect		= tcp_disconnect,
	.accept			= inet_csk_accept,
	.ioctl			= tcp_ioctl,
	.init			= tcp_v4_init_sock,
	.destroy		= tcp_v4_destroy_sock,
	.shutdown		= tcp_shutdown,
	.setsockopt		= tcp_setsockopt,
	.getsockopt		= tcp_getsockopt,
	.keepalive		= tcp_set_keepalive,
	.recvmsg		= tcp_recvmsg,
	.sendmsg		= tcp_sendmsg,
	.sendpage		= tcp_sendpage,
	.backlog_rcv		= tcp_v4_do_rcv,
	.release_cb		= tcp_release_cb,
	.hash			= inet_hash,
    .get_port		= inet_csk_get_port,
......
}

In INET_ In the create function, next create a struct inet_sock structure. This structure starts with struct sock, then extends some other information, and the rest of the code fills in this information. In this scene, we will often see that we put one structure at the beginning of another structure, and then extend some members to access these members through forced type conversion of pointers.

This completes the creation of socket.

Parse bind function

Next, let's look at bind.

SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;
 
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		err = move_addr_to_kernel(umyaddr, addrlen, &address);
		if (err >= 0) {
			err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

In bind, sockfd_lookup_light will find the struct socket structure according to the fd file descriptor. Then copy the sockaddr from the user state to the kernel state, and then call the bind function of ops in the struct socket structure. According to the previous settings when creating the socket, INET is called_ stream_ Bind function of OPS, that is, call inet_bind.

int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
	struct sock *sk = sock->sk;
	struct inet_sock *inet = inet_sk(sk);
	struct net *net = sock_net(sk);
	unsigned short snum;
......
	snum = ntohs(addr->sin_port);
......
	inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
	/* Make sure we are allowed to bind here. */
	if ((snum || !inet->bind_address_no_port) &&
	    sk->sk_prot->get_port(sk, snum)) {
......
	}
	inet->inet_sport = htons(inet->inet_num);
	inet->inet_daddr = 0;
	inet->inet_dport = 0;
	sk_dst_reset(sk);
}

Sk will be called in bind_ Get of prot_ Port function, i.e. inet_csk_get_port to check whether the ports conflict and can be bound. If allowed, struct INET is set_ The local address INET of the sock_ Saddr and our port inet_sport, the other party's address inet_daddr and each other's port inet_dport is initialized to 0.

The logic of bind is relatively simple. That's all.

Parse listen function

Next, let's look at listen.

SYSCALL_DEFINE2(listen, int, fd, int, backlog)
{
	struct socket *sock;
	int err, fput_needed;
	int somaxconn;
 
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
		if ((unsigned int)backlog > somaxconn)
			backlog = somaxconn;
		err = sock->ops->listen(sock, backlog);
		fput_light(sock->file, fput_needed);
	}
	return err;
}

In listen, we still use sockfd_lookup_light, find the struct socket structure according to the fd file descriptor. Next, we call the listen function of ops in the struct socket structure. According to the previous settings when creating the socket, INET is called_ stream_ The listen function of OPS, that is, call inet_listen.

int inet_listen(struct socket *sock, int backlog)
{
	struct sock *sk = sock->sk;
	unsigned char old_state;
	int err;
	old_state = sk->sk_state;
	/* Really, if the socket is already in listen state
	 * we can only allow the backlog to be adjusted.
	 */
	if (old_state != TCP_LISTEN) {
		err = inet_csk_listen_start(sk, backlog);
	}
	sk->sk_max_ack_backlog = backlog;
}

If the socket is not in TCP_ In listen status, INET will be called_ csk_ listen_ Start enters the listening state.

int inet_csk_listen_start(struct sock *sk, int backlog)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct inet_sock *inet = inet_sk(sk);
	int err = -EADDRINUSE;
 
	reqsk_queue_alloc(&icsk->icsk_accept_queue);
 
	sk->sk_max_ack_backlog = backlog;
	sk->sk_ack_backlog = 0;
	inet_csk_delack_init(sk);
 
	sk_state_store(sk, TCP_LISTEN);
	if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
......
	}
......
}

A new structure INET is established_ connection_ Sock, this structure starts with struct inet_sock,inet_csk actually does a forced type conversion and expands the structure. See, it's another routine.

struct inet_ connection_ The sock structure is complex. If you open it, you can see queues in various states, various timeout times, congestion control and other words. We say that TCP is connection oriented, that is, both the client and the server have a structure to maintain the connection state, that is, this structure. We will not analyze the variables in detail here, because there are too many variables. Later, we will analyze one by one.

First, we encounter icsk_accept_queue. What does it do?

In the TCP state, there is a listen state. When the listen function is called, it will enter this state. Although when we write a program, we usually wait for the server to call accept and let the client initiate a connection wherever we wait. In fact, once the server is in the listen state, the client can initiate a connection without accepting. In fact, there is no state of whether TCP is accepted or not. What is the role of the accept function?

In the kernel, two queues are maintained for each Socket. One is that the connection queue has been established. At this time, the three handshakes of the connection have been completed and are in the established state; One is the queue that has not completely established a connection. At this time, the three handshakes have not been completed and are in syn_ Status of RCVD.

When the server calls the accept function, it actually takes out a completed connection in the first queue for processing. If it is not finished, block and wait. Icsk here_ accept_ The queue is the first queue.

After initialization, set the TCP status to TCP_LISTEN, call get again_ Port determines whether the ports conflict.

So far, the logic of listen is over.

Parse accept function

Next, we resolve the server call accept.

SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr,
		int __user *, upeer_addrlen)
{
	return sys_accept4(fd, upeer_sockaddr, upeer_addrlen, 0);
}
 
SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
		int __user *, upeer_addrlen, int, flags)
{
	struct socket *sock, *newsock;
	struct file *newfile;
	int err, len, newfd, fput_needed;
	struct sockaddr_storage address;
......
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	newsock = sock_alloc();
	newsock->type = sock->type;
	newsock->ops = sock->ops;
	newfd = get_unused_fd_flags(flags);
	newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
	err = sock->ops->accept(sock, newsock, sock->file->f_flags, false);
	if (upeer_sockaddr) {
		if (newsock->ops->getname(newsock, (struct sockaddr *)&address, &len, 2) < 0) {
		}
		err = move_addr_to_user(&address,
					len, upeer_sockaddr, upeer_addrlen);
	}
	fd_install(newfd, newfile);
......
}

The implementation of the accept function confirms the principle of socket. The original socket is a listening socket. Here we will find the original struct socket and create a new newsock based on it. This is the connection socket. In addition, we will create a new struct file and fd and associate them with the socket.

The socket - > Ops - > accept of struct socket will also be called, that is, INET will be called_ stream_ Accept function of OPS, i.e. inet_accept.

int inet_accept(struct socket *sock, struct socket *newsock, int flags, bool kern)
{
	struct sock *sk1 = sock->sk;
	int err = -EINVAL;
	struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, &err, kern);
	sock_rps_record_flow(sk2);
	sock_graft(sk2, newsock);
	newsock->state = SS_CONNECTED;
}

inet_accept will call SK1 - > sk of struct sock_ Prot - > accept, that is, TCP_ Accept function of prot, inet_csk_accept function.

/*
 * This will accept the next outstanding connection.
 */
struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
	struct request_sock *req;
	struct sock *newsk;
	int error;
 
	if (sk->sk_state != TCP_LISTEN)
		goto out_err;
 
	/* Find already established connection */
	if (reqsk_queue_empty(queue)) {
		long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
		error = inet_csk_wait_for_connect(sk, timeo);
	}
	req = reqsk_queue_remove(queue, sk);
	newsk = req->sk;
......
}
 
/*
 * Wait for an incoming connection, avoid race conditions. This must be called
 * with the socket locked.
 */
static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	DEFINE_WAIT(wait);
	int err;
	for (;;) {
		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
					  TASK_INTERRUPTIBLE);
		release_sock(sk);
		if (reqsk_queue_empty(&icsk->icsk_accept_queue))
			timeo = schedule_timeout(timeo);
		sched_annotate_sleep();
		lock_sock(sk);
		err = 0;
		if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
			break;
		err = -EINVAL;
		if (sk->sk_state != TCP_LISTEN)
			break;
		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			break;
		err = -EAGAIN;
		if (!timeo)
			break;
	}
	finish_wait(sk_sleep(sk), &wait);
	return err;
}

inet_ csk_ The implementation of accept confirms the logic of the two queues mentioned above. If icsk_ accept_ If queue is empty, INET is called_ csk_ wait_ for_ Connect to wait; When waiting, call schedule_timeout, release the CPU, and set the process status to TASK_INTERRUPTIBLE.

If the CPU wakes up again, we will then judge icsk_ accept_ Whether the queue is empty will also call signal_pending to see if there is a signal to process. Once icsk_ accept_ If the queue is not empty, start from INET_ csk_ wait_ for_ Return from connect, take out a struct sock object from the queue and assign it to newsk.

Parse connect function

Under what circumstances, icsk_accept_queue is not empty? Of course, only after three handshakes. Next, let's analyze the process of three handshakes.

The third handshake is generally initiated by the client calling connect.

SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
		int, addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	err = move_addr_to_kernel(uservaddr, addrlen, &address);
	err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen, sock->file->f_flags);
}

  You should be familiar with the implementation of the connect function at the beginning, or through sockfd_lookup_light, find the struct socket structure according to the fd file descriptor. Next, we will call the connect function of ops in the struct socket structure, and call INET according to the previous settings when creating the socket_ stream_ The connect function of OPS, that is, call inet_stream_connect.

/*
 *	Connect to a remote host. There is regrettably still a little
 *	TCP 'magic' in here.
 */
int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
			  int addr_len, int flags, int is_sendmsg)
{
	struct sock *sk = sock->sk;
	int err;
	long timeo;
 
	switch (sock->state) {
......
	case SS_UNCONNECTED:
		err = -EISCONN;
		if (sk->sk_state != TCP_CLOSE)
			goto out;
 
		err = sk->sk_prot->connect(sk, uaddr, addr_len);
		sock->state = SS_CONNECTING;
		break;
	}
 
	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
 
	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
......
		if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
			goto out;
 
		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			goto out;
	}
	sock->state = SS_CONNECTED;
}

In TCP_ v4_ In the connect function, ip_route_connect is actually a route selection. Why? Because three handshakes are about to send a SYN packet, it is necessary to gather the source address, source port, destination address and destination port. The target address and target port are server-side. It is known that the source port is randomly assigned by the client. Which source address should be used? At this time, you should select a route to see which network card goes out from, and you should fill in the IP address of which network card.

Next, before sending syn, we set the state of the client socket to TCP_SYN_SENT. Then initialize the seq num of TCP, that is, write_seq, and then call tcp_. Connect to send.

/* Build a SYN and send it off. */
int tcp_connect(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct sk_buff *buff;
	int err;
......
	tcp_connect_init(sk);
......
	buff = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, true);
......
	tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
	tcp_mstamp_refresh(tp);
	tp->retrans_stamp = tcp_time_stamp(tp);
	tcp_connect_queue_skb(sk, buff);
	tcp_ecn_send_syn(sk, buff);
 
	/* Send off SYN; include data in Fast Open. */
	err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
	      tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
......
	tp->snd_nxt = tp->write_seq;
	tp->pushed_seq = tp->write_seq;
	buff = tcp_send_head(sk);
	if (unlikely(buff)) {
		tp->snd_nxt	= TCP_SKB_CB(buff)->seq;
		tp->pushed_seq	= TCP_SKB_CB(buff)->seq;
	}
......
	/* Timer for repeating the SYN until an answer. */
	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
	return 0;
}

In TCP_ In connect, there is a new structure, struct tcp_sock, if you open it, you will find that it is struct INET_ connection_ An extension of sock, struct INET_ connection_ Socket in struct TCP_ The position at the beginning of the sock is accessed through forced type conversion. The old trick is repeated again.

struct tcp_ More TCP states are maintained in the sock. We also encountered reanalysis.

Next, tcp_init_nondata_skb initializes a syn packet, tcp_transmit_skb sends syn packet, inet_csk_reset_xmit_timer sets a timer. If the SYN is not sent successfully, it will be sent again.

The process of sending network packets will be explained in the next section. Here, we think SYN has been sent.

We go back__ inet_stream_connect function, when calling SK - > sk_ After prot - > connect, inet_wait_for_connect will always wait for the client to receive the ACK from the server. As we know, the server is also waiting after accept ing.

How are network packets received? For the detailed analysis process, we will explain it in the next section. Here, in order to analyze the three handshakes, we simply look at some things done by the TCP layer when the network packet is received.

static struct net_protocol tcp_protocol = {
	.early_demux	=	tcp_v4_early_demux,
	.early_demux_handler =  tcp_v4_early_demux,
	.handler	=	tcp_v4_rcv,
	.err_handler	=	tcp_v4_err,
	.no_policy	=	1,
	.netns_ok	=	1,
	.icmp_strict_tag_validation = 1,
}

We use struct net_ The handler in the protocol structure receives, and the function called is tcp_v4_rcv. The next call chain is tcp_v4_rcv->tcp_ v4_ do_ rcv->tcp_rcv_state_process. tcp_rcv_state_process, as the name suggests, is used to process the state changes caused by receiving a network packet.

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct inet_connection_sock *icsk = inet_csk(sk);
	const struct tcphdr *th = tcp_hdr(skb);
	struct request_sock *req;
	int queued = 0;
	bool acceptable;
 
	switch (sk->sk_state) {
......
	case TCP_LISTEN:
......
		if (th->syn) {
			acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;
			if (!acceptable)
				return 1;
			consume_skb(skb);
			return 0;
		}
......
}

At present, the server is in TCP_LISTEN status, and the package sent is SYN. Therefore, with the above code, call icsk - > icsk_ af_ ops->conn_ Request function. struct inet_ connection_ The operation corresponding to the sock is inet_connection_sock_af_ops, according to the following definition, actually calls tcp_v4_conn_request.

const struct inet_connection_sock_af_ops ipv4_specific = {
        .queue_xmit        = ip_queue_xmit,
        .send_check        = tcp_v4_send_check,
        .rebuild_header    = inet_sk_rebuild_header,
        .sk_rx_dst_set     = inet_sk_rx_dst_set,
        .conn_request      = tcp_v4_conn_request,
        .syn_recv_sock     = tcp_v4_syn_recv_sock,
        .net_header_len    = sizeof(struct iphdr),
        .setsockopt        = ip_setsockopt,
        .getsockopt        = ip_getsockopt,
        .addr2sockaddr     = inet_csk_addr2sockaddr,
        .sockaddr_len      = sizeof(struct sockaddr_in),
        .mtu_reduced       = tcp_v4_mtu_reduced,
};

tcp_v4_conn_request calls tcp_conn_request, this function is also relatively long, in which send is called_ Synack, but TCP is actually called_ v4_ send_synack. We don't care about the specific sending process. Look at the comments. We can know that after receiving syn, we reply to a SYN-ACK. After the reply, the server is in TCP_SYN_RECV.

int tcp_conn_request(struct request_sock_ops *rsk_ops,
		     const struct tcp_request_sock_ops *af_ops,
		     struct sock *sk, struct sk_buff *skb)
{
......
af_ops->send_synack(sk, dst, &fl, req, &foc,
				    !want_cookie ? TCP_SYNACK_NORMAL :
						   TCP_SYNACK_COOKIE);
......
}
 
/*
 *	Send a SYN-ACK after having received a SYN.
 */
static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
			      struct flowi *fl,
			      struct request_sock *req,
			      struct tcp_fastopen_cookie *foc,
			      enum tcp_synack_type synack_type)
{......}

At this time, it is the client's turn to receive network packets. They are all TCP protocol stacks, so there is no much difference between the process and the server, and they will still go to TCP_ rcv_ state_ The of the process function is only because the client is currently in TCP_ SYN_ In the sent state, you enter the following code branch.

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct inet_connection_sock *icsk = inet_csk(sk);
	const struct tcphdr *th = tcp_hdr(skb);
	struct request_sock *req;
	int queued = 0;
	bool acceptable;
 
	switch (sk->sk_state) {
......
	case TCP_SYN_SENT:
		tp->rx_opt.saw_tstamp = 0;
		tcp_mstamp_refresh(tp);
		queued = tcp_rcv_synsent_state_process(sk, skb, th);
		if (queued >= 0)
			return queued;
		/* Do step6 onward by hand. */
		tcp_urg(sk, skb, th);
		__kfree_skb(skb);
		tcp_data_snd_check(sk);
		return 0;
	}
......
}

tcp_rcv_synsent_state_process calls tcp_send_ack, send an ACK-ACK, and the client is in TCP after sending_ Established status.

It's the server's turn to receive network packets again. Let's go back to TCP_ rcv_ state_ The process function handles. Because the server is currently in TCP state_ SYN_ Recv state, so it takes another branch. When the network packet is received, the server is also in TCP_ESTABLISHED state, three handshakes are over.

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct inet_connection_sock *icsk = inet_csk(sk);
	const struct tcphdr *th = tcp_hdr(skb);
	struct request_sock *req;
	int queued = 0;
	bool acceptable;
......
	switch (sk->sk_state) {
	case TCP_SYN_RECV:
		if (req) {
			inet_csk(sk)->icsk_retransmits = 0;
			reqsk_fastopen_remove(sk, req, false);
		} else {
			/* Make sure socket is routed, for correct metrics. */
			icsk->icsk_af_ops->rebuild_header(sk);
			tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
			tcp_init_congestion_control(sk);
 
			tcp_mtup_init(sk);
			tp->copied_seq = tp->rcv_nxt;
			tcp_init_buffer_space(sk);
		}
		smp_mb();
		tcp_set_state(sk, TCP_ESTABLISHED);
		sk->sk_state_change(sk);
		if (sk->sk_socket)
			sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
		tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
		tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
		break;
......
}

Summary moment

In this section, we have analyzed other system calls except the receiving and sending of network packets. It can be seen that they have a unified data structure and process. See the following figure for details:

First, the Socket system call will have three-level parameters: family, type and protocol. Through these three-level parameters, they are respectively set in net_ proto_ Find the type linked list in the family table and the corresponding operation of protocol in the type linked list. This operation is divided into two layers. For TCP protocol, the first layer is inet_stream_ops layer, the second layer is tcp_prot layer.

Therefore, the following system call rules are the same:

  • bind layer 1 calls INET_ stream_ INET of OPS_ bind function, the second layer calls TCP_ INET of prot_ csk_ get_ Port function;
  • listen the first layer calls INET_ stream_ INET of OPS_ listen function, the second layer calls TCP_ INET of prot_ csk_ get_ Port function;
  • accept layer 1 calls INET_ stream_ INET of OPS_ accept function, the second layer calls TCP_ INET of prot_ csk_ accept function;
  • connect layer 1 calls INET_ stream_ INET of OPS_ stream_ connect function, the second layer calls TCP_ TCP of prot_ v4_ connect function.

Classroom practice

TCP's triple handshake protocol is very important. Please be sure to read it along with the code. In addition, we focus on the TCP scenario here. Please take a look at how UDP implements the functions of each layer when reading the code.

Topics: Linux data structure udp TCP/IP