Implementation of locks and atomic operations

Posted by Cheeseweasel on Thu, 30 Dec 2021 05:31:27 +0100

Atomicity of operation

Ten threads perform + + on the count variable at the same time, and each thread performs + + 100000 times. The ideal result is 1 million, but the final result is less than 1 million.

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

#define THREAD_COUNT   10


void *func(void *arg) {
    volatile int *pcount = (int *)arg;
    int i = 0;

    for (i = 0; i < 100000; i++) {
        (*pcount)++;
        usleep(1);
    }
}


int main() {

    pthread_t tid[THREAD_COUNT] = {0};

    int count = 0;

    int i = 0;
    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_create(&tid[i], NULL, func, &count);
    }

    for (i = 0; i < 100; i++) {
        printf("count --> %d\n", count);
        sleep(1);
    }

    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_join(tid[i], NULL);
    }

}

Why is the result less than 1 million? Because i + + is not an atomic operation.

#include <stdio.h>

int i = 0;
// gcc -S 1_test_i++.c
int main(int argc, char **argv)
{
    i++;
    return 0;
}

i + + is not an atomic operation

i + + assembly code

	movl	i(%rip), %eax //Load i from memory into register
	addl	$1, %eax //Add 1 to the value of the register
	movl	%eax, i(%rip) //Write the value of the register back to memory

Multiple threads count + + at the same time. Most of the time, the three instructions of thread 1 are executed and thread 2 is executed

However, sometimes the first or the first two instructions of thread 1 are executed, thread 2 is executed again, and finally the third instruction of thread 1 is executed, which will cause the final result to be less than 1 million.

This will cause thread insecurity. To solve this problem, you need to lock or use atomic operations.

mutex

If the lock cannot be obtained, the CPU is released and the thread is added to the waiting queue.

The task takes longer than context switching. mutex can be used.


#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

#define THREAD_COUNT   10

pthread_mutex_t mutex;


void *func(void *arg) {
    int *pcount = (int *)arg;
    int i = 0;

    for (i = 0; i < 100000; i++) {
        pthread_mutex_lock(&mutex);
        (*pcount)++;
        pthread_mutex_unlock(&mutex);
        usleep(1);
    }
}


int main() {

    pthread_t tid[THREAD_COUNT] = {0};

    int count = 0;

    pthread_mutex_init(&mutex, NULL);

    int i = 0;
    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_create(&tid[i], NULL, func, &count);
    }

    for (i = 0; i < 100; i++) {
        printf("count --> %d\n", count);
        sleep(1);
    }

    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_join(tid[i], NULL);
    }

}

 

Spin lock

If the lock cannot be obtained, continue the dead cycle to check the lock status. If it is in lock status, continue the dead cycle. Otherwise, lock and end the dead cycle.

(1) The task cannot be blocked. (2) the task takes a short time and several instructions

PTHREAD_PROCESS_SHARED indicates that the processes fork ed out can be shared.

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

#define THREAD_COUNT   10

pthread_spinlock_t spinlock;


void *func(void *arg) {
    int *pcount = (int *)arg;
    int i = 0;

    for (i = 0; i < 100000; i++) {
        pthread_spin_lock(&spinlock);
        (*pcount)++;
        pthread_spin_unlock(&spinlock);
        usleep(1);
    }
}


int main() {

    pthread_t tid[THREAD_COUNT] = {0};

    int count = 0;

    pthread_spin_init(&spinlock, PTHREAD_PROCESS_PRIVATE);

    int i = 0;
    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_create(&tid[i], NULL, func, &count);
    }

    for (i = 0; i < 100; i++) {
        printf("count --> %d\n", count);
        sleep(1);
    }

    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_join(tid[i], NULL);
    }

}

Comparison of usage scenarios of mutex and spinlock

  1. The operation of critical resources is simple / there is no system call. Select spinlock
  2. Complex operation / system call, select mutex.

It mainly depends on whether the operation is simpler or more complex than thread switching.

Read write lock

It is applicable to the scenario of more reading and less writing. Generally, the use of read-write lock is not recommended

Atomic operation

Use assembly instructions to realize + + i, and realize + + operation in one instruction through CPU instructions.


#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

#define THREAD_COUNT   10

int inc(int *value, int add) {
    int old;
/*
xaddl
 Swap the first operand (destination operand) with the second operand (source operand), and then load the sum of the two values into the destination operand. The destination operand can be a register or memory location; the source operand is a register.

This instruction can be used with the LOCK prefix.

TEMP  SRC + DEST
SRC  DEST
DEST  TEMP
*/
    __asm__ volatile (
        "lock; xaddl %2, %1;" // lock indicates that the bus of CPU operation memory is locked,% 2 represents add and% 1 represents * value
        : "=a" (old) // output old=eax
        : "m" (*value), "a" (add) // input m is the original memory. Put the add value into eax
        : "cc", "memory"
    );
    return old;
}


void *func(void *arg) {
    int *pcount = (int *)arg;
    int i = 0;

    for (i = 0; i < 100000; i++) {
        inc(pcount, 1);
        usleep(1);
    }
}


int main() {

    pthread_t tid[THREAD_COUNT] = {0};

    int count = 0;

    int i = 0;
    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_create(&tid[i], NULL, func, &count);
    }

    for (i = 0; i < 100; i++) {
        printf("count --> %d\n", count);
        sleep(1);
    }

    for (i = 0; i < THREAD_COUNT; i++) {
        pthread_join(tid[i], NULL);
    }

}

 

lock is the bus on which the CPU operates the memory

Atomic operations require CPU instruction set support.

CAS(Compare and Swap)

CAS comparison and exchange is a kind of atomic operation, which compares first and then assigns a value

Compare And Swap
if (a == b) {
    a = c;
}

cmpxchg(a, b, c)

bool CAS( int * pAddr, int nExpected, int nNew )
atomically {
    if ( *pAddr == nExpected ) {
        *pAddr = nNew ;
        return true ;
    }
    return false ;
}

The specific assembly code implementation can refer to the cas implementation in zmq lockless queue.

Implementation of atomic operation

  1. gcc and g + + compilers provide a set of atomic operation APIs;
  2. C++11 also provides a set of atomic operation APIs;
  3. It can also be implemented by assembly

summary

For complex situations, including file reading, socket operation and system call, mutex can be used;

The operation is relatively simple. If atomic operation is not supported, spinlock can be used;

Atomic operations can be used if the CPU instruction set supports them.

Thread switching consumes resources and costs a lot. Where is it reflected?

If it is only mov register, the cost is very small; In addition to register operation, thread switching also requires file system and virtual memory switching, so the cost is relatively high.

mutex, mutex_ The granularity of trylock and spinlock locks decreases in turn.

mutex is suitable for a long piece of code between lock execution, with the largest granularity, such as locking the whole number of B + or red black trees. If the lock cannot be obtained, the thread will be switched;

mutex_trylock attempts to obtain a lock and returns immediately. It does not switch threads. In an application, it is usually driven by a while loop, but because it is in the application layer, there may be thread switching.

spinlock keeps trying to obtain locks in the kernel without thread switching, and the granularity of locks is the smallest. Lock the CPU. The implementation of the underlying core code is a loop instruction. Constantly check whether the conditions are met and return only when the conditions are met.

How to judge the lock granularity?

If the execution time of the content to be locked is greater than the cost of thread switching, even if it is of large granularity, mutex can be used; If the execution time of the contents of the lock is very short, such as adding a few lines of code to a queue, even if the granularity is small, spinlock can be used.

If deadlock occurs, neither mutex nor spinlock can be used.

Atomic operation, with smaller granularity, depends on whether a single assembly instruction is supported.

rwlock, read-write lock, suitable for scenarios with more reading and less writing.

 

Topics: C Linux Back-end server