Lock implementation and concurrent data structure

Posted by jmoreno on Fri, 01 Oct 2021 02:13:47 +0200


Lock: it is placed around the critical area to ensure that the critical area is executed like a single atomic instruction. The lock makes the chaotic state originally scheduled by the OS controllable.

Implementation of lock

Discuss the goal that should be set before implementation, so how to evaluate the effect of a lock implementation? The first is the basic task: lock, provide mutual exclusion, and prevent multiple threads from entering the critical area. The second is fairness: try to ensure that every thread competing for locks has the same opportunity. The third is performance, which is divided into non competitive and competitive situations.

Control interrupt

Turn off interrupts in critical areas (developed for single processors).

It is simple and has many disadvantages: the malicious program does not un lock from the beginning to the end. When the interrupt is turned off, the operating system cannot get the controller and can only restart; It is not applicable to multiprocessors. If one CPU is interrupted, other CPUs may access the critical area; Interruption loss, self loss 800; Low switching efficiency.

The kernel itself can use this method in some cases.

Hardware support

The following hardware support is atomic.

Atomic exchange: hardware support. The simplest thing is to test and set instructions, that is, atomic exchange. What the test and setting instruction does is: for a variable, get old value, set new value, and then return old value.

Compare and exchange: such as name.

Linked load linked and store conditional instructions: it will succeed only when the last loaded address is not updated during.

fetch and add: get the value and add + 1 at the original address.

Implementation of lock

The general principle of spin lock is as follows:

typedef struct lock_t
{
    int flag;
} lock_t;

void init(lock_t *mutex)
{
    // 0 -> lock is available, 1 -> held
    mutex->flag = 0;
}

void lock(lock_t *mutex)
{
    while (mutex->flag == 1) // TEST the flag
        ;                    // spin-wait (do nothing)
    mutex->flag = 1;         // now SET it!
}

void unlock(lock_t *mutex)
{
    mutex->flag = 0;
}

The problem is also obvious. The software is not enough. The discrimination statement and the statement to be executed next may be interrupted by another thread and need hardware support.

The previous lock can be modified based on the atomic exchange instruction:

void lock(lock_t *lock)
{
    // 1: Locked by another thread, loop
    // 0: set to 1, return 0, lock return
    while(TestAndSet(&lock->flag,1)==1)
        ;
}

A single processor requires a clock interrupt spin thread to run other threads. Spin threads never give up the CPU.

Evaluation of spin lock: with the correctness, the fairness is not discussed at all. The performance of a single CPU is very expensive. The operating system clock such as spin time slice needs to be interrupted to run. Another thread releases the lock and then returns to the spinning thread. The performance of multiple CPUs is good, because the critical area is generally very short and can be used soon.

The linked loading and conditional storage realize spin locking as follows:

void lock(lock_t *lock)
{
	while (1) 
    {
        while (LoadLinked(&lock->flag) == 1)
            ;
        if(StoreConditional(&lock->flag,1)==1)
            return;  // Set lock success return, failed loop, only one condition store can succeed
    }
}

Similarly, a lock can be achieved by acquiring and adding one.

// For implementation logic only, support shall be provided for hardware
int FetchAndAdd(int *ptr)
{
    int old = *ptr;
    *ptr = old + 1;
    return old;
}
typedef struct lock_t
{
    int ticket;
    int turn;
} lock_t;

void lock_init(lock_t *lock)
{
    lock->ticket = 0;
    lock->turn = 0;
}

void lock(lock_t *lock)
{
    // If you want to lock it, take a ticket and arrange a number
    int myturn = FetchAndAdd(&lock->ticket);
    while (lock->turn != myturn)
        ; // spin
}

void unlock(lock_t *lock)
{
    // Unlock, call the next number
    FetchAndAdd(&lock->turn);
}

fetch and add ensures that all threads can grab the lock. It's always number one.

Spin problem: does spin consume CPU, or the whole time slice in a single processor, because it won't give up CPU.

The first solution is to actively abandon the CPU. Original code:

while (mutex->flag == 1) // TEST the flag
    ;                    // spin-wait (do nothing)

Amend to read:

while (mutex->flag == 1) // TEST the flag
    yield();             // Discard CPU

yield() is an operating system primitive. It actively abandons the CPU and changes running to ready state (threads have three states: running, ready and blocking, and some are more detailed).

Two threads yield() is good. The problem with many threads grabbing a lock is that they may jump around when running it in the ready state, and the cost of context switching is very high.

The corresponding solution is not to jump to the ready state, but to block.

The above lock can be improved to a queue, a park() to let the thread sleep, and an unpark() to wake the thread.

Two-stage lock: spin for a period of time before sleeping.

Lock based concurrent data structure

Concurrency counter

Common practice:

typedef struct {
    int value;
    pthread_mutex_t lock;
}counter;

void init(counter *c)
{
    pthread_mutex_init(&c->lock,NULL);
    c->value = 0;
}

void inc(counter *c,int n)
{
    /pthread_mutex_lock(&c->lock);
    c->value += n;
    pthread_mutex_unlock(&c->lock);   
}

int get(counter *c,int n)
{
    pthread_mutex_lock(&c->lock);
    int ret = c->value;
    pthread_mutex_unlock(&c->lock);  
    return ret; 
}

The test code is as follows:

void *inc10000000(void *p)
{
    counter *c = (counter *)p;
    for(int i=0;i<10000000;i++)
        inc(c,1);
}

int main()
{
    counter c;
    init(&c);
    pthread_t pid1;
    pthread_t pid2;
    pthread_create(&pid1,NULL,inc10000000,(void *)&c);
    pthread_create(&pid2,NULL,inc10000000,(void *)&c);
    pthread_join(pid1,NULL);
    pthread_join(pid2,NULL);
    printf("done, c->value: %d\n",c.value);
    return 0;
}

Comparison of results:

# Unlocked
l@vm:~/ostep$ time ./a.out 
done, c->value: 10804806  # The result of not locking is chaotic

real	0m0.105s    
user	0m0.199s
sys	0m0.004s

# Lock
l@vm:~/ostep$ time ./a.out 
done, c->value: 20000000

real	0m0.301s  # Slow locking speed
user	0m0.295s
sys	0m0.008s

An improved mechanism: local lock, lazy method. The global lock is requested only when the local value is added to a certain value.

#define NCPU 4

typedef struct {
    int value;
    pthread_mutex_t lock;

    int localv[NCPU];
    pthread_mutex_t locallock[NCPU];
}counter;

typedef struct {
    counter *c;
    int who;
}counterinfo;

void init(counter *c)
{
    pthread_mutex_init(&c->lock,NULL);
    c->value = 0;
    for(int i=0;i<NCPU;i++)
    {
        c->localv[i] = 0;
        pthread_mutex_init(&c->locallock[i],NULL);
    }
}

void inc(counter *c,int n)
{
    pthread_mutex_lock(&c->lock);
    c->value += n;
    pthread_mutex_unlock(&c->lock); 
}

void inc2(counter *c,int n,int who)
{
    pthread_mutex_lock(&c->locallock[who]);
    c->localv[who] += n;
    if(c->localv[who]>50)
    {
        inc(c,c->localv[who]);
        c->localv[who] = 0;
    }
    pthread_mutex_unlock(&c->locallock[who]);  
}

Test code:

void *inc10000000(void *p)
{
    counterinfo *c1 = (counterinfo *)p;
    counter *c = c1->c;
    int who = c1->who;
    for(int i=0;i<10000000;i++)
        inc2(c,1,who);
    inc(c,c->localv[who]);
}

int main()
{
    counter c;
    init(&c);
    pthread_t pids[NCPU];
    counterinfo cinfo[NCPU];
    for(int i=0;i<NCPU;i++)
    {
        cinfo[i].c = &c;
        cinfo[i].who = i;
        pthread_create(&pids[i],NULL,inc10000000,(void *)&cinfo[i]);
    }
    for(int i=0;i<NCPU;i++)
        pthread_join(pids[i],NULL);
    printf("done, c->value: %d\n",c.value);
    return 0;
}

Tested on raspberry pie, arm 4-core.

# Local variable locking
pi@raspberrypi:~/ostep $ time ./a.out 
done, c->value: 40000000

real	0m5.126s
user	0m16.283s
sys	0m0.001s

# Local variables are unlocked
pi@raspberrypi:~/ostep $ time ./a.out 
done, c->value: 40000000

real	0m0.264s
user	0m0.837s
sys	0m0.031s

Under limited conditions, in this example, it should also be possible to partially unlock. According to my test, the effect of the lazy method here is actually worse. This example should be that the critical area is too short, and the cost of locking and unlocking is even greater than the time of waiting for the global lock. Then I add a delay to the inc task, and the advantage of the lazy method is obvious.

Concurrent linked list

The global lock has nothing to say.

hand over hand locking, also known as lock coupling. Each node has a lock, but the defect is the same as before. Locking and unlocking are expensive, but they are slower than single locks.

Concurrent queue

You can also use global locks.

According to the nature of the queue, it can only process the beginning and end nodes, so the incoming and outgoing locks can be executed concurrently.

reference resources

Operating Systems: Three Easy Pieces

This article was published on orzlinux.cn

Topics: Linux data structure Operating System