Now there's a scene like this: I'm a busy big boss. I have 100 mobile phones. When my mobile phone comes to information, my secretary will tell me, "Boss, your mobile phone is coming to information."I'm very angry. That's how my secretary does it. Every time a message comes from my mobile phone, only tell me it. The boss goes to see it.But she never said it clearly: which mobile phone came to inform her?I have 100 mobile phones!So, I can only check one mobile phone by one to determine which mobile phones to use for information.This is the disadvantage of the select model in IO reuse!The boss thought that if the Secretary could get the mobile phone with the information directly to my desk, my efficiency would certainly increase (this is the epoll model).
Let's first summarize the shortcomings of the select model:
- There is a maximum limit on the number of file descriptors that a single process can monitor, usually 1024, although you can change the number, but the more file descriptors there are, the worse the performance will be because select scans them in a polling fashion; (In the linux kernel header file, this is defined: #define u FD_SETSIZE 1024)
- Core/user space memory copy problem, select needs to copy a large number of handle data structures, resulting in a huge overhead;
Selectect returns an array containing the entire handle, and the application needs to traverse the entire array to discover which handles have an event. - Select triggers horizontally, and if the application does not complete an IO operation on a ready file descriptor, each subsequent select call will still notify the process of these file descriptors.
Imagine a scenario where 1 million clients are maintaining a TCP connection with a server process at the same time.Usually only a few hundred or thousands of TCP connections are active at any one time (in fact, this is true in most scenarios).How to achieve such high concurrency?
Roughly, if a process has a maximum of 1024 file descriptors, then we need to run 1,000 processes to process 1 million customer connections.If we use the select model, only a few of these 1,000 processes will connect to receive data for a certain period of time, then we will have to poll 1024 file descriptors to determine which customers have data to read. Consider how much system resource consumption would be if the 1,000 processes behave similarly.
In view of the shortcomings of the select model, the epoll model has been proposed!
Advantages of the epoll model
- Supports a process to open a large number of socket descriptors
- IO efficiency does not linearly decrease with the number of FD s
- Using mmap to speed up messaging between the kernel and user space
Two working modes of epoll
LT(level triggered, horizontal triggered mode) works by default and supports both block and non-block socket s.In this way, the kernel tells you if a file descriptor is ready, then you can IO the ready fd.If you don't do anything, the kernel will continue to notify you, so the chances of programming errors in this mode are slightly lower.For example, the kernel tells you that one of the FDS can read the data, so you should read it immediately.You are still lazy and don't read this data. The next time the kernel finds that you haven't read the data yet, it will tell you to read it again.This mechanism can better ensure that each data user is disposed of.
ET(edge-triggered, edge-triggered mode) is a fast-working method that only supports no-block socket s.In this mode, the kernel tells you through epoll when the descriptor is never ready to change to ready.It then assumes that you know the file descriptor is ready and that you will not send any more ready notifications to that file descriptor until the next time new data comes in.In short, the things that the kernel has notified will not be repeated a second time, the data is missed and unread, and you are responsible for it.This mechanism does increase in speed, but risks go hand in hand.
epoll model API
#include <sys/epoll.h> /* Create a handle to the epoll that size uses to tell the kernel how many listeners it needs.When the epoll handle is created, It takes up an fd value, so after using epoll, close() must be called to close, otherwise the fd may be exhausted.*/ int epoll_create(int size); /*epoll Event Registration Function*/ int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); /*Wait for the event to arrive, and if it is detected, copy all ready events from the kernel event table to the array pointed to by its second parameter, events*/ int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
Epoll's event registration function epoll_ctl, the first parameter is the return value of epoll_create(), the second parameter is the action, expressed in the following three macros:
POLL_CTL_ADD //Register new fd into epfd; EPOLL_CTL_MOD //Modify the monitoring events for registered fd s; EPOLL_CTL_DEL //Delete an fd from epfd;
The struct epoll_event structure is as follows:
typedef union epoll_data { void *ptr; int fd; __uint32_t u32; __uint64_t u64; } epoll_data_t; struct epoll_event { __uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ };
Evets in the epoll_event structure can be a collection of the following macros:
EPOLLIN //Indicates that the corresponding file descriptor is readable (including the normal shutdown of the opposite SOCKET); EPOLLOUT //Indicates that the corresponding file descriptor is writable; EPOLLPRI //Indicates that the corresponding file descriptor has urgent data to read (this should indicate that out-of-band data is coming); EPOLLERR //Indicates an error occurred in the corresponding file descriptor; EPOLLHUP //Indicates that the corresponding file descriptor is suspended; EPOLLET //Set EPOLL to Edge Triggered mode, which is relative to Level Triggered. EPOLLONESHOT//Listen for only one event, and after listening for this event, if you need to continue listening for this socket, you need to add it to the EPOLL queue again.
A simple example of using epoll
#include <sys/socket.h> #include <sys/epoll.h> #include <netinet/in.h> #include <arpa/inet.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #define MAXLINE 5 #define OPEN_MAX 100 #define LISTENQ 20 #define SERV_PORT 5000 #define INFTIM 1000 void setnonblocking(int sock) { int opts; opts=fcntl(sock,F_GETFL); if(opts<0) { perror("fcntl(sock,GETFL)"); exit(1); } opts = opts|O_NONBLOCK; if(fcntl(sock,F_SETFL,opts)<0) { perror("fcntl(sock,SETFL,opts)"); exit(1); } } int main(int argc, char* argv[]) { int i, maxi, listenfd, connfd, sockfd,epfd,nfds, portnumber; ssize_t n; char line[MAXLINE]; socklen_t clilen; if ( 2 == argc ) { if( (portnumber = atoi(argv[1])) < 0 ) { fprintf(stderr,"Usage:%s portnumber/a/n",argv[0]); return 1; } } else { fprintf(stderr,"Usage:%s portnumber/a/n",argv[0]); return 1; } //Declare variables for the epoll_event structure, ev for registering events, and array for returning events to process struct epoll_event ev,events[20]; //Generate epoll-specific file descriptors for processing accept s epfd=epoll_create(256); struct sockaddr_in clientaddr; struct sockaddr_in serveraddr; listenfd = socket(AF_INET, SOCK_STREAM, 0); //Set socket to non-blocking //setnonblocking(listenfd); //Set the file descriptor associated with the event to be processed ev.data.fd=listenfd; //Set the type of event to process ev.events=EPOLLIN|EPOLLET; //ev.events=EPOLLIN; //Register epoll events epoll_ctl(epfd,EPOLL_CTL_ADD,listenfd,&ev); bzero(&serveraddr, sizeof(serveraddr)); serveraddr.sin_family = AF_INET; char *local_addr="127.0.0.1"; inet_aton(local_addr,&(serveraddr.sin_addr));//htons(portnumber); serveraddr.sin_port=htons(portnumber); bind(listenfd,(struct sockaddr *)&serveraddr, sizeof(serveraddr)); listen(listenfd, LISTENQ); maxi = 0; for ( ; ; ) { //Waiting for the epoll event to occur nfds=epoll_wait(epfd,events,20,500); //Handle all events that occur for(i=0;i<nfds;++i) { if(events[i].data.fd==listenfd)//If a new SOCKET user is detected to be connected to a bound SOCKET port, establish a new connection. { connfd = accept(listenfd,(struct sockaddr *)&clientaddr, &clilen); if(connfd<0){ perror("connfd<0"); exit(1); } //setnonblocking(connfd); char *str = inet_ntoa(clientaddr.sin_addr); printf("accapt a connection from\n "); //Setting file descriptors for read operations ev.data.fd=connfd; //Set Read Action Events for Annotation ev.events=EPOLLIN|EPOLLET; //ev.events=EPOLLIN; //Register ev epoll_ctl(epfd,EPOLL_CTL_ADD,connfd,&ev); } else if(events[i].events&EPOLLIN)//If the user is already connected and receives data, read in. { printf("EPOLLIN\n"); if ( (sockfd = events[i].data.fd) < 0) continue; if ( (n = read(sockfd, line, MAXLINE)) < 0) { if (errno == ECONNRESET) { close(sockfd); events[i].data.fd = -1; } else printf("readline error\n"); } else if (n == 0) { close(sockfd); events[i].data.fd = -1; } if(n<MAXLINE-2) line[n] = '\0'; //Setting file descriptors for write operations ev.data.fd=sockfd; //Set Write Action Events for Annotation ev.events=EPOLLOUT|EPOLLET; //Modify the event to be handled on sockfd to EPOLLOUT //epoll_ctl(epfd,EPOLL_CTL_MOD,sockfd,&ev); } else if(events[i].events&EPOLLOUT) // If there is data to send { sockfd = events[i].data.fd; write(sockfd, line, n); //Setting file descriptors for read operations ev.data.fd=sockfd; //Set Read Action Events for Annotation ev.events=EPOLLIN|EPOLLET; //Modify the event to be processed on sockfd to EPOLIN epoll_ctl(epfd,EPOLL_CTL_MOD,sockfd,&ev); } } } return 0; }
epoll server with ET and LT dual mode
#include <stdio.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <stdlib.h> #include <sys/epoll.h> #include <pthread.h> #include <errno.h> #include <stdbool.h> #Maximum number of define MAX_EVENT_NUMBER 1024 //event #define BUFFER_SIZE 10 //Buffer Size #Define ENABLE_ET 1 //Enable ET mode /* Set file descriptor to non-congested */ int SetNonblocking(int fd) { int old_option = fcntl(fd, F_GETFL); int new_option = old_option | O_NONBLOCK; fcntl(fd, F_SETFL, new_option); return old_option; } /* Register EPOLLIN on file descriptor FD into the epoll kernel event table indicated by epoll_fd, and the parameter enable_et specifies whether et mode is enabled for FD */ void AddFd(int epoll_fd, int fd, bool enable_et) { struct epoll_event event; event.data.fd = fd; event.events = EPOLLIN; //Registering the fd is readable if(enable_et) { event.events |= EPOLLET; } epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event); //Register the fd with the epoll kernel event table SetNonblocking(fd); } /* LT Work mode features: robust but inefficient */ void lt_process(struct epoll_event* events, int number, int epoll_fd, int listen_fd) { char buf[BUFFER_SIZE]; int i; for(i = 0; i < number; i++) //number: number of events ready { int sockfd = events[i].data.fd; if(sockfd == listen_fd) //If it is a file descriptor for listen, it indicates that a new customer is connected to { struct sockaddr_in client_address; socklen_t client_addrlength = sizeof(client_address); int connfd = accept(listen_fd, (struct sockaddr*)&client_address, &client_addrlength); AddFd(epoll_fd, connfd, false); //Register new customer connection fd to epoll event table, using lt mode } else if(events[i].events & EPOLLIN) //Readable with client data { // This code is triggered as long as the data in the buffer has not been read.This is what LT mode is all about: repeating notifications until processing is complete printf("lt mode: event trigger once!\n"); memset(buf, 0, BUFFER_SIZE); int ret = recv(sockfd, buf, BUFFER_SIZE - 1, 0); if(ret <= 0) //After reading the data, remember to turn off fd { close(sockfd); continue; } printf("get %d bytes of content: %s\n", ret, buf); } else { printf("something unexpected happened!\n"); } } } /* ET Work mode features: efficient but potentially dangerous */ void et_process(struct epoll_event* events, int number, int epoll_fd, int listen_fd) { char buf[BUFFER_SIZE]; int i; for(i = 0; i < number; i++) { int sockfd = events[i].data.fd; if(sockfd == listen_fd) { struct sockaddr_in client_address; socklen_t client_addrlength = sizeof(client_address); int connfd = accept(listen_fd, (struct sockaddr*)&client_address, &client_addrlength); AddFd(epoll_fd, connfd, true); //Use et mode } else if(events[i].events & EPOLLIN) { /* This code will not be triggered repeatedly, so we cycle through the data to make sure that all the data in the socket read cache is read out.This is how we eliminate the potential dangers of the ET model */ printf("et mode: event trigger once!\n"); while(1) { memset(buf, 0, BUFFER_SIZE); int ret = recv(sockfd, buf, BUFFER_SIZE - 1, 0); if(ret < 0) { /* For non-congested IO, the following condition is true to indicate that the data has been read completely, after which epoll can trigger the EPOLLIN event on sockfd again to drive the next read operation */ if(errno == EAGAIN || errno == EWOULDBLOCK) { printf("read later!\n"); break; } close(sockfd); break; } else if(ret == 0) { close(sockfd); } else //Not finished, continue reading in a loop { printf("get %d bytes of content: %s\n", ret, buf); } } } else { printf("something unexpected happened!\n"); } } } int main(int argc, char* argv[]) { if(argc <= 2) { printf("usage: ip_address + port_number\n"); return -1; } const char* ip = argv[1]; int port = atoi(argv[2]); int ret = -1; struct sockaddr_in address; bzero(&address, sizeof(address)); address.sin_family = AF_INET; inet_pton(AF_INET, ip, &address.sin_addr); address.sin_port = htons(port); int listen_fd = socket(PF_INET, SOCK_STREAM, 0); if(listen_fd < 0) { printf("fail to create socket!\n"); return -1; } ret = bind(listen_fd, (struct sockaddr*)&address, sizeof(address)); if(ret == -1) { printf("fail to bind socket!\n"); return -1; } ret = listen(listen_fd, 5); if(ret == -1) { printf("fail to listen socket!\n"); return -1; } struct epoll_event events[MAX_EVENT_NUMBER]; int epoll_fd = epoll_create(5); //Event table size 5 if(epoll_fd == -1) { printf("fail to create epoll!\n"); return -1; } AddFd(epoll_fd, listen_fd, true); //Add listen file descriptor to event table using ET mode epoll while(1) { int ret = epoll_wait(epoll_fd, events, MAX_EVENT_NUMBER, -1); if(ret < 0) { printf("epoll failure!\n"); break; } if(ENABLE_ET) { et_process(events, ret, epoll_fd, listen_fd); } else { lt_process(events, ret, epoll_fd, listen_fd); } } close(listen_fd); return 0; }
Then write a simple TCP client to test it:
//Client #include <sys/types.h> #include <sys/socket.h> #include <stdio.h> #include <netinet/in.h> #include <arpa/inet.h> #include <unistd.h> #include <stdlib.h> #include <sys/time.h> int main() { int client_sockfd; int len; struct sockaddr_in address;//Server-side Network Address Structures int result; char str1[] = "ABCDE"; char str2[] = "ABCDEFGHIJK"; client_sockfd = socket(AF_INET, SOCK_STREAM, 0);//Set up client socket address.sin_family = AF_INET; address.sin_addr.s_addr = inet_addr("127.0.0.1"); address.sin_port = htons(8888); len = sizeof(address); result = connect(client_sockfd, (struct sockaddr *)&address, len); if(result == -1) { perror("oops: client2"); exit(1); } //First reading and writing write(client_sockfd, str1, sizeof(str1)); sleep(5); //Second reading and writing write(client_sockfd, str2, sizeof(str2)); close(client_sockfd); return 0; }
The TCP client acts like this: first send the string "ABCDE" to the server side, then send the string "ABCDEFGHIJK" to the server side after 5 seconds. Let's see how ET mode servers and LT mode servers read data differently.
ET mode
ET mode phenomena analysis: Our server read buffer size is set to 10.The first time we accepted a string, our buffer had enough space to accept it, so printing out the content "ABCDE" and printing out "read later" would indicate that the data has been read.The second time we received a string, we did not have enough buffer space to receive all the characters, so we received it twice.However, the total number of triggers is only two.
LT mode
LT Mode Phenomenon Analysis:
Similarly, the first received string has enough space to accept, and the second received string has insufficient buffer space, so the second received string is accepted twice.Also note that the kernel will continue to notify you to receive data as long as you have not fully received the last data!So the number of times an event is triggered is three.
EPOLLONESHOT Event
Even if we use ET mode, an event on a socket may be triggered multiple times, which can cause problems in concurrent programs.For example, one county begins to process data on a socket after it has read it, and new data is readable on the socket during the process of data coming out (EPOLLIN is triggered again), while another county is awakened to read the new data.A situation arises where two threads operate on a socket at the same time.This is certainly not what we expected. What we expected was that a socket connection would be handled by only one thread at any one time.This can be achieved using the EPOLLONESHOT event.
For a file descriptor with an EPOLLONSHOT event registered, the operating system triggers at most one readable, writable, or exception event registered on it, and only once, unless we reset the EPOLLONESHOT event registered on the file descriptor using the epoll_ctl function.This way, when a thread is working on a socket, it is impossible for other threads to have the opportunity to operate the socket.On the other hand, once a socket registered for an EPOLLONESHOT event has been processed by a thread, the thread should immediately reset the EPOLLONESHOT event on the socket to ensure that the next time the socket is readable, its EPOLLIN event can be triggered, thereby giving other worker threads the opportunity to continue processing the socket.
Here is an epoll server using EPOLLONESHOT
#include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #include <stdio.h> #include <unistd.h> #include <errno.h> #include <string.h> #include <fcntl.h> #include <stdlib.h> #include <sys/epoll.h> #include <pthread.h> #include <stdbool.h> #define MAX_EVENT_NUMBER 1024 #define BUFFER_SIZE 10 struct fds { int epollfd; int sockfd; }; int SetNonblocking(int fd) { int old_option = fcntl(fd, F_GETFL); int new_option = old_option | O_NONBLOCK; fcntl(fd, F_SETFL, new_option); return old_option; } void AddFd(int epollfd, int fd, bool oneshot) { struct epoll_event event; event.data.fd = fd; event.events = EPOLLIN | EPOLLET; if(oneshot) { event.events |= EPOLLONESHOT; } epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); SetNonblocking(fd); } /*Reset the event on fd. After this, although the EPOLLONESHOT event on FD is registered, the operating system still triggers the EPOLLIN event on FD and only once*/ void reset_oneshot(int epollfd, int fd) { struct epoll_event event; event.data.fd = fd; event.events = EPOLLIN | EPOLLET | EPOLLONESHOT; epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); } /*Work Threads*/ void* worker(void* arg) { int sockfd = ((struct fds*)arg)->sockfd; int epollfd = ((struct fds*)arg)->epollfd; printf("start new thread to receive data on fd: %d\n", sockfd); char buf[BUFFER_SIZE]; memset(buf, 0, BUFFER_SIZE); while(1) { int ret = recv(sockfd, buf,BUFFER_SIZE-1, 0); if(ret == 0) { close(sockfd); printf("foreigner closed the connection\n"); break; } else if(ret < 0) { if(errno = EAGAIN) { reset_oneshot(epollfd, sockfd); printf("read later\n"); break; } } else { printf("get content: %s\n", buf); //Hibernate for 5 seconds to simulate data processing printf("worker working...\n"); sleep(5); } } printf("end thread receiving data on fd: %d\n", sockfd); } int main(int argc, char* argv[]) { if(argc <= 2) { printf("usage: ip_address + port_number\n"); return -1; } const char* ip = argv[1]; int port = atoi(argv[2]); int ret = -1; struct sockaddr_in address; bzero(&address, sizeof(address)); address.sin_family = AF_INET; inet_pton(AF_INET, ip, &address.sin_addr); address.sin_port = htons(port); int listenfd = socket(PF_INET, SOCK_STREAM, 0); if(listenfd < 0) { printf("fail to create socket!\n"); return -1; } ret = bind(listenfd, (struct sockaddr*)&address, sizeof(address)); if(ret == -1) { printf("fail to bind socket!\n"); return -1; } ret = listen(listenfd, 5); if(ret == -1) { printf("fail to listen socket\n"); return -1; } struct epoll_event events[MAX_EVENT_NUMBER]; int epollfd = epoll_create(5); if(epollfd == -1) { printf("fail to create epoll\n"); return -1; } //Note that EPOLLONESHOT events cannot be registered on socket listenfd, otherwise the application can only process one client connection!Because subsequent client connection requests will no longer trigger the EPOLLIN event for listenfd AddFd(epollfd, listenfd, false); while(1) { int ret = epoll_wait(epollfd, events, MAX_EVENT_NUMBER, -1); //Permanent Wait if(ret < 0) { printf("epoll failure!\n"); break; } int i; for(i = 0; i < ret; i++) { int sockfd = events[i].data.fd; if(sockfd == listenfd) { struct sockaddr_in client_address; socklen_t client_addrlength = sizeof(client_address); int connfd = accept(listenfd, (struct sockaddr*)&client_address, &client_addrlength); //Register EPOLLONESHOT events for each non-listening file descriptor AddFd(epollfd, connfd, true); } else if(events[i].events & EPOLLIN) { pthread_t thread; struct fds fds_for_new_worker; fds_for_new_worker.epollfd = epollfd; fds_for_new_worker.sockfd = events[i].data.fd; /*Start a new worker thread to serve sockfd*/ pthread_create(&thread, NULL, worker, &fds_for_new_worker); } else { printf("something unexpected happened!\n"); } } } close(listenfd); return 0; }
EPOLLONESHOT Mode Phenomenon Analysis: We continue to use the TCP client above to test, we need to modify the client sleep time to 3 seconds.The workflow is that when the client sends data for the first time, the server's receive buffer has enough space, and then the server's worker thread enters the 5-second data processing phase; after 3 seconds, the client continues to send new data, but the worker thread is still processing data and cannot receive new data immediately.Two seconds later, the client finished processing the data and started receiving new data.It can be observed that our clients only use the same thread to process requests from the same client, which is expected.