Talk about Goroutine scheduling strategy~

Posted by Pavel_Nedved on Mon, 25 Oct 2021 04:09:50 +0200

Original address: Talk about Goroutine scheduling strategy~

In< Talk about Goroutine thread model and scheduler (I) >As mentioned in the article, goroutine scheduling is essentially a process in which the program code selects the appropriate goroutine at a certain time according to a certain algorithm and runs it on the CPU. This sentence solves the three core problems of the scheduling system:

  1. Scheduling timing: when will scheduling occur?

  2. Scheduling strategy: what strategy is used to select the next goroutine running in the CPU?

  3. Switching mechanism: how to run the selected goroutine on the CPU?

This paper mainly talks about scheduling strategy.

Look at 2467 lines of code in the runtime/proc.go file to analyze the schedule function:

// One round of scheduler: find a runnable goroutine and execute it.// Never returns.func schedule() {    _g_ := getg()   //_g_ = m.g0    ......    var gp *g    ......       if gp == nil  {/ / check the global runnable queue once in a while to ensure fairness. / / otherwise two goroutines can completely occupy the local runqueue / / by constantly repaying each other. / / to ensure fairness in scheduling, each worker thread needs to first obtain goroutines from the global run queue for each 61 times of scheduling// If only goroutines in the local run queue are scheduled, goroutines in the global run queue may not be run. If _g_. M.p.ptr(). Schedtick% 61 = = 0 & & sched. Runqsize > 0 {lock (& sched. Lock) / / all worker threads can access the global run queue, so it is necessary to lock gp = globrunqget(_g_.m.p.ptr(), 1) //Get 1 goroutine unlock (& sched. Lock)}} if GP = = nil {/ / get goroutine GP from the local run queue of P associated with m, inherittime = runqget (_g_.m.p.ptr()) if GP! = nil & & _g_.m.spinning {throw ("schedule: spinning with local work")}} if GP = = nil {/ / if no goroutine needs to be run is found from both the local run queue and the global run queue, / / call the findrunnable function to steal it from the run queue of other worker threads. If not, the current worker thread will go to sleep. / / the findrunnable function will not return until the goroutine needs to be run is obtained. GP, inherit Ttime = findrunnable() / / blocks until work is available}... / / the running code is runtime, and the function call stack uses G0 stack space. / / call execute to switch to GP code and stack space to run execute(gp, inheritTime)}

The schedule searches for runnable goroutine s from each running queue in three steps, as follows:

  1. Find a goroutine from the global run queue. In order to ensure scheduling fairness, each worker thread needs to find a goroutine running from the global run queue first every 61 times of scheduling, so as to ensure that goroutines in the global run queue can get scheduling opportunities. The global run queue can be accessed by all worker threads, so it needs to be locked before accessing it.

  2. Find goroutine from the local running queue of the worker thread. If goroutine is not required or cannot be obtained from the global running queue, it shall be obtained from the local running queue.

  3. Steal goroutine from the running queue of other worker threads. If goroutine is not obtained in the second part, you need to call findrunnable to steal goroutine from the running queue of other worker threads. Before stealing, findrunnable or try again to find the goroutine to be run from the global running queue and the local running queue of the current workflow.

Finding a runnable goroutine from the global run queue is completed through globrunqget. The first parameter of the function is p bound to the current worker thread, and the second parameter max indicates how many g can be taken from the global run queue to the local run queue of the current worker thread.

Look at the 4663 line code in the runtime/proc.go file:

// Try get a batch of G's from the global runnable queue. / / sched must be locked.func globrunqget (_p_ * P, Max int32) * g {if sched. Runqsize = = 0 {/ / the global run queue is empty return nil} / / divide the goroutines in the global run queue evenly according to the number of P n: = sched. Runqsize / gomaxprocs + 1 if n > sched. Runqsize {/ / the above method of calculating n may cause n to be greater than the number of goroutines in the global run queue n = sched. Runqsize} if Max > 0 & & n > Max {n = max / / max goroutines} if n > int32 (len (_p_. Runq)) / 2 {n = int32 (len (_p_. Runq)) / 2 / / only half of the local queue capacity can be taken at most} sched.runqsize -= n / / directly return GP through the function. Other goroutines are put into the local operation queue through runqput. GP: = sched. Runq. Pop() / / Pop get n -- for; n > 0; n -- {GP1: = sched. Runq. Pop() / / get a goroutine runqput (_p_, GP1, false) from the global operation queue //Put into the local run queue} return gp}

globrunqget first calculates how many goroutines should be taken according to the number of goroutines in the global run queue, the parameter max and the capacity of the local run queue of _p _, and then returns the first g object to the calling function through the return value. Others are put into the local run queue of the current working thread through runqput. The above code is calculating how many gorouts should be taken from the global run queue Load balancing is done according to P (gomax procs) when ine is used.

If no runnable goroutine is found in the global run queue, you need to find runnable goroutine in the local run queue of the current worker thread.

As can be seen from the above code, the local running queue of the worker thread is divided into two parts. One part is a lockless circular queue composed of three members: P's runq, runqhead and runqtail, which can contain up to 256 goroutines. The other part is p's runnext, a pointer to the g structure object and only one goroutine.

Searching for runnable goroutines from the local running queue of the current working thread is completed through runqget. When searching, the code first checks whether the runnext member is empty. If it is not empty, it returns the goroutine referred to by runnext and clears the runnext member. Otherwise, it looks for goroutines in the circular queue.

Look at the 4825 line code in the runtime/proc.go file, and analyze runqget:

// Get G from local runnable queue. / / if inherittime is true, GP should inherit the remaining time in the / / current time slice. Otherwise, it should start a new time slice. / / executed only by the owner p.func runqget (_p_ * P) (GP * g, inherittime bool) {/ / if there's a runnext, it's the next g to run. / / get goroutine for from the runnext member {/ / check whether the runnext member is empty. If not, the goroutine next is returned: = _p_. Runnext if next = = 0 {break} if _p_. Runnext.cas (next, 0) {return next. Ptr(), true}} / / get goroutine for {H: = Atomic. Loadacq (&_p_. Runqhead)  // load-acquire, synchronize with other consumers        t := _p_.runqtail        if t == h {            return nil, false        }        gp := _p_.runq[h%uint32(len(_p_.runq))].ptr()        if atomic.CasRel(&_p_.runqhead, h, h+1) { // cas-release, commits consume            return gp, false        }    }}

It can be seen that cas operation is used for getting goroutine from runnext or from the circular queue. cas here is necessary because other working threads may also access these two members and steal the runnable goroutine from the current thread.

The code uses atomic.LoadAcq and atomic.CasRel for the runqhead operation, providing load acquire and CAS release semantics respectively.

The semantics of atomic.LoadAcq are as follows:

  1. Atomic reading, no matter which platform the code runs on, ensures that no other thread will write to the variable during the reading process.

  2. After atomic.LoadAcq, the memory can be read and written only after atomic.LoadAcq is read. Neither the compiler nor the CPU can disrupt this order.

  3. When the current thread executes atomic.LoadAcq, it can read the last value written to the same variable by other threads through atomic.CasRel. At the same time, the code after atomic.LoadAcq can read the memory written by the code before atomic.CasRel (operation on the same variable) in other threads regardless of the value in the memory address.

The semantics of atomic.CasRel are as follows:

  1. The operation of comparing and exchanging atoms.

  2. The code before atomic.CasRel reads and writes to memory must be completed before atomic.CasRel writes to memory. Neither the compiler nor the CPU can disrupt this order.

  3. After the thread executes atomic.CasRel, other threads can read the same variable through atomic.LoadAcq and read the latest value. Meanwhile, the value written to memory by the code before atomic.CasRel can be read by the code after atomic.LoadAcq (operation on the same variable) in other threads.

Because multiple threads may read and write runqhead concurrently and need to rely on the value of runqhead to read the elements of the runq array, atomic.LoadAcq and atomic.CasRel should be used to ensure the above semantics.

Then why don't you use atomic.LoadAcq and atomic.CasRel to read p's runqtail?

That's because runqtail will not be modified by other threads, but can only be modified by the current working thread. If no one else moves it, there is no need to use atomic related operations.

The cas operation needs to pay attention to the ABA problem. Does runqget have this problem in the above two places where cas is used?

The answer is No. the analysis is as follows:

  1. The first is the cas operation on runnext. Only the current working thread bound to _p will modify runnext to A non-0 value. Other working threads will only modify runnext from A non-0 value to A 0 value. The current working thread bound to _p is executing the code here. Therefore, after the current working thread reads the value A, it is impossible for any thread to modify its value to B (0) and then modify it back to A.

  2. Then look at the cas operation on runq. The current working thread operates the local running queue of _p. Only the current working thread bound to _p will modify runqtail because it adds goroutine to the queue. Other working threads will not add goroutine to the queue, nor will they modify runqtail, but only runqhead. Therefore, when the working thread reads A from runqhead After that, it is impossible for other working threads to change the runqhead value to B and then change its value to A for the second time (because runqtail cannot be modified within this period of time, the runqhead value cannot cross runqtail and then return to the A value). In other words, the code logically eliminates the conditions that cause ABA.

So far, we have finished talking about the processes related to worker threads obtaining goroutine from global run queue and local run queue. In the next article, we will talk about the processes of obtaining goroutine from other worker threads.

The above is only a personal point of view, not necessarily accurate. It's the best to help you.

Well, that's the end of this article. If you like, let's have a triple hit.

Scan code to pay attention to official account and get more quality content.

 

Topics: Go Back-end