i/o timeout. I hope you don't step into the pit of this net/http package

Posted by roscor on Fri, 04 Mar 2022 04:27:33 +0100

 

The article is continuously updated. You can search "growth of golang Xiaobai" on wechat for the first time, and reply to [tutorial] to get a free video tutorial of golang. This article has been included in GitHub https://github.com/xiaobaiTech/golangFamily , there are complete test sites and growth routes for large factory interviews. Welcome Star.

 

problem

Let's look at a piece of daily code.

 1package main
 2
 3import (
 4    "bytes"
 5    "encoding/json"
 6    "fmt"
 7    "io/ioutil"
 8    "net"
 9    "net/http"
10    "time"
11)
12
13var tr *http.Transport
14
15func init() {
16    tr = &http.Transport{
17        MaxIdleConns: 100,
18        Dial: func(netw, addr string) (net.Conn, error) {
19            conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
20            if err != nil {
21                return nil, err
22            }
23            err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
24            if err != nil {
25                return nil, err
26            }
27            return conn, nil
28        },
29    }
30}
31
32func main() {
33    for {
34        _, err := Get("http://www.baidu.com/")
35        if err != nil {
36            fmt.Println(err)
37            break
38        }
39    }
40}
41
42
43func Get(url string) ([]byte, error) {
44    m := make(map[string]interface{})
45    data, err := json.Marshal(m)
46    if err != nil {
47        return nil, err
48    }
49    body := bytes.NewReader(data)
50    req, _ := http.NewRequest("Get", url, body)
51    req.Header.Add("content-type", "application/json")
52
53    client := &http.Client{
54        Transport: tr,
55    }
56    res, err := client.Do(req)
57    if res != nil {
58        defer res.Body.Close()
59    }
60    if err != nil {
61        return nil, err
62    }
63    resBody, err := ioutil.ReadAll(res.Body)
64    if err != nil {
65        return nil, err
66    }
67    return resBody, nil
68}

The simple thing to do is to request circularly http://www.baidu.com/ , and wait for a response.

Looks like there's nothing wrong.

When the code runs, it can send and receive messages normally.

However, if this code runs for a period of time, an i/o timeout error will appear.

 

This is actually a problem that has been checked recently. It is found that this pit may be easier to step on. I simplified the code here.

The phenomenon in actual production is that when the golang service initiates an HTTP call, although http If transport sets 3s timeout, i/o timeout error will occur occasionally.

However, when viewing the downstream services, it is found that the downstream services have actually returned in 100ms.

 

Check

Analysis of message body change corresponding to five layer network protocol

It's strange that the display processing time of the server is only 100ms, and the timeout of the client is set to 3s. How can i/o timeout occur?

 

There are two possibilities here.

  • Because the log printed by the server is actually only the log printed by the application layer of the server. However, after the client application layer sends data, it also passes through the transport layer, network layer, data link layer and physical layer of the client, and then passes through the physical layer, data link layer, network layer and transport layer of the server to the application layer of the server. It takes 100ms at the application layer of the server, and then return to the original way. The remaining 3s-100ms may be consumed in all layers of the whole process. For example, when the network is bad, TCP at the transport layer tries hard to lose packets and retransmit.

  • There is no problem with the network. The whole sending and receiving process of the client-server link takes about 100ms. The client processing logic problem caused a timeout.

     

Generally, the problem will not be the problem of the underlying network in most cases. It's right to boldly doubt that it's your own problem. If you don't give up, grab a bag and have a look.

Packet capture results

After analysis, shake hands three times from the beginning (where the red box is drawn).

In the end, there is an i/o timeout error (where the blue box is drawn).

The time column is from 7 to 10, with an interval of 3s. And look at the blue box in the lower right corner. It is a Reset connection from port 51169 to port 80.

Port 80 is the port of the server. In other words, the client 3s times out and actively disconnects the link.

But take a closer look at the first line. From the third handshake to the end when the client times out and actively disconnects, there are actually many HTTP requests.

Go back to the way the code sets the timeout.

 1    tr = &http.Transport{
 2        MaxIdleConns: 100,
 3        Dial: func(netw, addr string) (net.Conn, error) {
 4            conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
 5            if err != nil {
 6                return nil, err
 7            }
 8            err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
 9            if err != nil {
10                return nil, err
11            }
12            return conn, nil
13        },
14    }

In other words, the 3s timeout here is actually calculated after the connection is established, rather than the timeout calculated from a single call.

Look, the note says

SetDeadline sets the read and write deadlines associated with the connection.

 

 

Timeout reason

As we all know, HTTP is an application layer protocol and TCP is used in the transport layer.

Before 1.0, HTTP protocol used short connection by default, and TCP connection will be established every time a request is initiated. Send and receive data. Then disconnect.

TCP connections are handshaked three times at a time. Wave four times each time you disconnect.

In fact, it is not necessary to establish a new connection every time. Just keep the established connection open and reuse the data every time.

As a result, the HTTP protocol has used long connections by default since 1.1. See the previous for specific information This article.

Then the golang standard library is also compatible with this implementation.

Establish a TCP long connection for each domain name by establishing a connection pool, such as http://baidu.com and http://golang.com Just two different domain names.

First visit http://baidu.com A connection will be established when the domain name is used up, and then it will be put into the free connection pool for access next time http://baidu.com The connection will be retrieved from the connection pool for reuse.

Multiplex long connection

 

A digression: this also explains the previous This article In the last question, why should we emphasize the same domain name: a domain name will establish a connection, and a connection corresponds to a read goroutine and a write goroutine. It is precisely because it is the same domain name that three goroutines will be disclosed in the end. If different domain names are used, 1+2*N processes will be disclosed, and N is the number of domain names.

 

Assuming that the first request takes 100ms, each request is completed http://baidu.com After that, they are put into the connection pool. Next time, they continue to be reused for 29 times, taking 2900ms.

At the 30th request, the connection has taken 3000ms from the start of establishment to the return of the service, just reaching the set 3s timeout threshold, then the client will report timeout i/o timeout.

Although it only took 100ms for the server at this time, it took a long time to add up the previous 29 times.

That is, just through HTTP Transport has err = conn.setdeadline (time. Now()) Add (time. Second * 3)), and you use a long connection. Even if the server processes faster and the timeout set by the client is longer, your program will report a timeout error at one moment.

Correct posture

The original expectation was to set a timeout for each call, not for the entire connection.

In addition, the reason for the above problem is that the timeout is set for long connections, and long connections will be reused.

Based on these two points, change the code.

 1package main
 2
 3import (
 4    "bytes"
 5    "encoding/json"
 6    "fmt"
 7    "io/ioutil"
 8    "net/http"
 9    "time"
10)
11
12var tr *http.Transport
13
14func init() {
15    tr = &http.Transport{
16        MaxIdleConns: 100,
17        //The following code was killed
18        //Dial: func(netw, addr string) (net.Conn, error) {
19        //  conn, err := net.DialTimeout(netw, addr, time.Second*2) / / set the connection timeout
20        //  if err != nil {
21        //      return nil, err
22        //  }
23        //  err = conn.SetDeadline(time.Now().Add(time.Second * 3)) / / set the timeout for sending and receiving data
24        //  if err != nil {
25        //      return nil, err
26        //  }
27        //  return conn, nil
28        //},
29    }
30}
31
32
33func Get(url string) ([]byte, error) {
34    m := make(map[string]interface{})
35    data, err := json.Marshal(m)
36    if err != nil {
37        return nil, err
38    }
39    body := bytes.NewReader(data)
40    req, _ := http.NewRequest("Get", url, body)
41    req.Header.Add("content-type", "application/json")
42
43    client := &http.Client{
44        Transport: tr,
45        Timeout: 3*time.Second,  //The timeout added here is the timeout of each call
46    }
47    res, err := client.Do(req) 
48    if res != nil {
49        defer res.Body.Close()
50    }
51    if err != nil {
52        return nil, err
53    }
54    resBody, err := ioutil.ReadAll(res.Body)
55    if err != nil {
56        return nil, err
57    }
58    return resBody, nil
59}
60
61func main() {
62    for {
63        _, err := Get("http://www.baidu.com/")
64        if err != nil {
65            fmt.Println(err)
66            break
67        }
68    }
69}

Look at the notes and you will find that there are two points to change

  • http. Some timeout settings in transport for connection establishment have been killed.

  • When an HTTP request is initiated, the scenario http Client, add the timeout setting at this time. The timeout here can be understood as the timeout of a single request. You can also see the notes

Timeout specifies a time limit for requests made by this Client.

Here, the code will be changed and the problems in actual production will be solved.

In the example code, if you take it to run, you will actually make the following mistakes

1Get http://www.baidu.com/: EOF

This is because it's called too hard, http://www.baidu.com The active disconnection over there can be understood as a current limiting measure to protect the server. After all, everyone does it like this, and the server will explode...

It's easy to add a sleep interval to each HTTP call. It's a good solution.

 

Here, in fact, the problem has been solved. Next, we will analyze the causes of the problem at the source level. Friends who are not interested in reading the source code can directly pull to the lower right corner of the bottom of the article and do something positive energy (click twice) to support it. (crazy hint, please, this is really important to me!)

Source code analysis

The go version used is 1.12.7.

Start with a network request.

 1res, err := client.Do(req)
 2func (c *Client) Do(req *Request) (*Response, error) {
 3    return c.do(req)
 4}
 5
 6func (c *Client) do(req *Request) {
 7    // ...
 8    if resp, didTimeout, err = c.send(req, deadline); err != nil {
 9    // ...
10  }
11    // ...  
12}  
13func send(ireq *Request, rt RoundTripper, deadline time.Time) {
14    // ...    
15    resp, err = rt.RoundTrip(req)
16     // ...  
17} 
18
19//Enter the # RoundTrip # logic from here
20/src/net/http/roundtrip.go: 16
21func (t *Transport) RoundTrip(req *Request) (*Response, error) {
22    return t.roundTrip(req)
23}
24
25func (t *Transport) roundTrip(req *Request) (*Response, error) {
26    //Try to get an idle connection to initiate an http connection
27  pconn, err := t.getConn(treq, cm)
28  // ...
29}
30
31//Focus on this function. The return is a long connection
32func (t *Transport) getConn(treq *transportRequest, cm connectMethod) (*persistConn, error) {
33  //Omit a lot of logic and focus on the following two points
34    //Return when there is an idle connection
35    pc := <-t.getIdleConnCh(cm)
36
37  //No connection created
38  pc, err := t.dialConn(ctx, cm)
39
40}

Many of the above codes are actually just to show how this part of the code is tracked down, so that you can follow it when you look at the source code.

Finally, there is a getConn method in the above code. When initiating a network request, a network connection will be taken first, and the connection has two sources.

  • If there is an idle connection, take the idle connection

    1/src/net/http/tansport.go:810
    2func (t *Transport) getIdleConnCh(cm connectMethod) chan *persistConn {
    3 //Replay chan of idle connection
    4 ch, ok := t.idleConnCh[key]
    5   // ...
    6 return ch
    7}
    
  • If there is no free connection, create a long connection.

1/src/net/http/tansport.go:1357
2func (t *Transport) dialConn() {
3  //...
4  conn, err := t.dial(ctx, "tcp", cm.addr())
5  // ...
6  go pconn.readLoop()
7  go pconn.writeLoop()
8  // ...
9}

When an http request is initiated for the first time, there must be no idle connection at this time, and a new connection will be established. A read goroutine and a write goroutine will be created at the same time.

Reading and writing process

Note t.dial(ctx, "tcp", cm.addr()) in the above code. If HTTP is set as at the beginning of the article Transport

 1Dial: func(netw, addr string) (net.Conn, error) {
 2   conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
 3   if err != nil {
 4      return nil, err
 5   }
 6   err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
 7   if err != nil {
 8      return nil, err
 9   }
10   return conn, nil
11},

Then this will be executed in the following dial

1func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
2   // ...
3  c, err := t.Dial(network, addr)
4  // ...
5}

The settings called here are timed out and will be executed until

 1/src/net/net.go
 2func (c *conn) SetDeadline(t time.Time) error {
 3    //...
 4    c.fd.SetDeadline(t)
 5    //...
 6}
 7
 8//...
 9
10func setDeadlineImpl(fd *FD, t time.Time, mode int) error {
11    // ...
12    runtime_pollSetDeadline(fd.pd.runtimeCtx, d, mode)
13    return nil
14}
15
16
17//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadline
18func poll_runtime_pollSetDeadline(pd *pollDesc, d int64, mode int) {
19    // ...
20  //Set a timer event
21  rtf = netpollDeadline
22    //And register the event in the timer
23  modtimer(&pd.rt, pd.rd, 0, rtf, pd, pd.rseq)
24}  

The above source code, in short, is that when the request is called for the first time, a connection will be established, and a timer event will be registered at this time. Assuming that the time is set to 3s, this event will occur after 3s, and then the logic of registering the event will be executed. The registration event is netpollDeadline. Note the netpollDeadline, which will be mentioned later.

Read / write coroutine timer event

The timeout event is set and occurs after 3s. Send and receive data normally in the next period. Business as usual.

Until 3s later, when you read goroutine, you will wait for the network data to return.

1/src/net/http/tansport.go:1642
2func (pc *persistConn) readLoop() {
3    //...
4    for alive {
5        _, err := pc.br.Peek(1)  //Blocking reading the data returned by the server
6    //...
7}

Then it's always with the code.

 1src/bufio/bufio.go: 129
 2func (b *Reader) Peek(n int) ([]byte, error) {
 3   // ...
 4   b.fill() 
 5   // ...   
 6}
 7
 8func (b *Reader) fill() {
 9    // ...
10    n, err := b.rd.Read(b.buf[b.w:])
11    // ...
12}
13
14/src/net/http/transport.go: 1517
15func (pc *persistConn) Read(p []byte) (n int, err error) {
16    // ...
17    n, err = pc.conn.Read(p)
18    // ...
19}
20
21// /src/net/net.go: 173
22func (c *conn) Read(b []byte) (int, error) {
23    // ...
24    n, err := c.fd.Read(b)
25    // ...
26}
27
28func (fd *netFD) Read(p []byte) (n int, err error) {
29    n, err = fd.pfd.Read(p)
30    // ...
31}
32
33/src/internal/poll/fd_unix.go: 
34func (fd *FD) Read(p []byte) (int, error) {
35    //...
36  if err = fd.pd.waitRead(fd.isFile); err == nil {
37    continue
38  }
39    // ...
40}
41
42func (pd *pollDesc) waitRead(isFile bool) error {
43    return pd.wait('r', isFile)
44}
45
46func (pd *pollDesc) wait(mode int, isFile bool) error {
47    // ...
48  res := runtime_pollWait(pd.runtimeCtx, mode)
49    return convertErr(res, isFile)
50}

Until it reaches the runtime_pollWait, which can be simply regarded as waiting for the return of data from the server.

 1//go:linkname poll_runtime_pollWait internal/poll.runtime_pollWait
 2func poll_runtime_pollWait(pd *pollDesc, mode int) int {
 3
 4    // 1. If the network returns data normally, it will jump out
 5  for !netpollblock(pd, int32(mode), false) {
 6    // 2. If there is an error, it will also jump out
 7        err = netpollcheckerr(pd, int32(mode))
 8        if err != 0 {
 9            return err
10        }
11    }
12    return 0
13}

When the whole link follows, it will always wait for data. There are only two waiting results

  • There is data that can be read
  • Error reporting

There are two kinds of error reporting

  • Connection closed
  • overtime
1func netpollcheckerr(pd *pollDesc, mode int32) int {
2    if pd.closing {
3        return 1 // errClosing
4    }
5    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {
6        return 2 // errTimeout
7    }
8    return 0
9}

The timeout mentioned here refers to the number 2 returned here, which will be converted into ErrTimeout through the following function, and ErrTimeout Error() is actually i/o timeout.

 1func convertErr(res int, isFile bool) error {
 2    switch res {
 3    case 0:
 4        return nil
 5    case 1:
 6        return errClosing(isFile)
 7    case 2:
 8        return ErrTimeout // ErrTimeout.Error() is "i/o} timeout"
 9    }
10    println("unreachable: ", res)
11    panic("unreachable")
12}

So here comes the question.... How does the timeout error returned above, that is, how does the condition meet when returning 2?

1    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {
2        return 2 // errTimeout
3    }

Remember netpoll deadline just mentioned?

This contains the logic executed when the timer 3s reaches the point.

 1func timerproc(tb *timersBucket) {
 2    //The timer has reached the set time point. Register the function before triggering
 3    f(arg, seq) //netpollDeadline was previously registered
 4}
 5
 6func netpollDeadline(arg interface{}, seq uintptr) {
 7    netpolldeadlineimpl(arg.(*pollDesc), seq, true, true)
 8}
 9
10/src/runtime/netpoll.go: 428
11func netpolldeadlineimpl(pd *pollDesc, seq uintptr, read, write bool) {
12    //...
13    if read {
14        pd.rd = -1
15        rg = netpollunblock(pd, 'r', false)
16    }
17    //...
18}

PD. Is set here Rd = - 1 refers to poller descriptor Read deadline means the read timeout time of the file descriptor of the network poller. We know that everything is a file in linux. The file here actually refers to the socket used in this network communication.

At this time, go back and see that the timeout condition is if (mode = = 'R' & & PD. Rd < 0).

So far. We received an error message of io timeout in our code.

summary

  • Don't use HTTP Timeout is set in transport, which is the timeout of connection, not the timeout of request. Otherwise, an inexplicable io timeout error may occur.

  • The timeout of the request is set in the create client.

If the article is helpful to you, look at the bottom right corner of the article and do something positive energy (click twice) to support it. (crazy hint, please, this is really important to me!)

I'm Xiaobai. I'll see you next time.

 

Topics: Go Back-end http net