Hand tearing ring queue Series II: lock free to achieve high concurrency

Posted by audiodef on Thu, 30 Dec 2021 05:38:46 +0100

This article is the second in the series of hand tearing ring queue. The links to previous articles are as follows:
Hand tearing ring queue

The previous article introduces a basic ring queue, which can be used in multithreading, but there is one premise:
At any time, there can only be one producer and one consumer.

In other words, if multiple producers want to write to the queue concurrently, they need to lock or other concurrency control externally to ensure that at most one producer actually writes to the ring queue at any time. Similarly, if multiple consumers want to read from the queue for consumption, they also need to lock or other concurrency control externally to ensure that at most one consumer reads from the ring queue at any time.

The content of this article is to introduce how to support multi-threaded scenario, multi producer concurrent write and multi consumer concurrent read, which is completely solved by the ring queue without any additional external control. Moreover, the lock free technology is used to avoid the impact of the re operation of locking and unlocking on the performance.

In the lock free data structure, the main technical implementation means is to use the atomic instructions of cpu. Before introducing atomic instructions, let's talk about what happens without atomic instructions.

Usually, after the statements written in the program source code are compiled into binary, a line of text statements in the code will become binary multiple assembly instructions, so this line of text statements is not atomic when executed by the cpu. Multi line text statements are not atomic. When multiple threads execute these text statements concurrently, the corresponding multi line assembly statements will be executed simultaneously on multiple cpu cores, which can not guarantee the execution order relationship between them. When multiple threads read and write a shared data at the same time, various misjudgments will occur, resulting in wrong results.

Take ring queue as an example to illustrate this problem:
The ring queue is in the initial state and the queue is empty. Both producer threads write to the queue and call ring_queue_push() method. In the function implementation of this method, the producer 1 thread reads tail as 0, and the producer 2 thread also reads tail as 0. Then producer 1 writes data to position 0, then increases tail by 1 and tail becomes 1.
Producer 2 also writes data to position 0, and then increases tail by 1 The process of adding 1 to Tai:
tail = tail + 1;
Since the tail value initially read by producer 2 is 0, the cpu core may not realize that the tail has been modified by other threads, so it also thinks that the tail is 0, so it finally
tail = 0 + 1 = 1;

In the end, producer 2 overwrites the data of producer 1 (the data is lost), but there are two rings_ queue_ The push () function call returns success. This is a serious Bug!

In the actual multithreaded environment, the code execution timing is different between CPUs. Therefore, without any protection, serious bugs will be generated when writing to the same memory location, concurrent reading and concurrent writing to the same variable.

In order to solve these problems, atomic instructions are on the stage!

With these instructions, the operation of data is also atomic in the case of multiple CPUs. The so-called atomicity means that as the smallest unit of execution, it can no longer be divided. The cpu core either executes the instruction or has not executed the instruction. When one cpu core executes half of the instruction, another cpu core starts to execute the instruction.

By correctly using the atomic instructions of cpu, various problems in multithreading concurrency can be effectively solved.

In solving the problem of multithreading concurrency, the conventional method is to use mutex, semaphore, condvar, etc. These can be understood as coarse-grained locks, which are simple to use and widely applicable, but have poor performance.
The atomic instruction of cpu is a fine-grained lock at the instruction level of cpu. Its performance is very high, but its design is complex.

Various operating systems and development languages provide wrapper functions for cpu atomic instructions, so we don't need to write assembly instructions by hand.
Taking gcc as an example, gcc provides a series of builtin atomic functions, such as the one we will use today:

bool __sync_bool_compare_and_swap(type *ptr, type oldval， type newval);

This function will point ptr to the value in memory and compare it with oldval. If it is equal, it will modify the value of ptr execution memory to newval The whole process of comparison and modification should be completed in an atomic way. Returns true if the comparison is equal and the modification is successful. false is returned in other cases.
This function, also called cas, takes the acronym of compare and swap.

We use atomic instructions to enhance the ring queue and realize multi production and multi consumer concurrent reading and writing. The idea is as follows:
For writing, each producer must first obtain a write lock. After successfully obtaining the write lock, write data, move the tail to the next position, and finally release the write lock.

For reading, each consumer must first obtain a read lock. After successfully obtaining the read lock, read the data, move the head to the next position, and finally release the read lock.

The whole idea is exactly the same as the traditional reading and writing of shared data through mutex control, but in terms of technical implementation, we use atomic instructions. This implementation method is called unlocked data structure.

In addition, it should be noted that:
For variables such as head and tail, since multiple threads will read and write concurrently, we need to modify them with volatile to prevent cpu core from caching them and avoid reading old data.

Lockless ring queue supports concurrent reading and writing by multiple producers and consumers. The source code implemented in C language is as follows:

// ring_queue.h
#ifndef RING_QUEUE_H
#define RING_QUEUE_H

typedef struct ring_queue_t {
    char* pbuf;
    int item_size;
    int capacity;

    volatile int write_flag;
    volatile int read_flag;

    volatile int head;
    volatile int tail;
    volatile int same_cycle;
} ring_queue_t;

int ring_queue_init(ring_queue_t* pqueue, int item_size, int capacity);
void ring_queue_destroy(ring_queue_t* pqueue);
int ring_queue_push(ring_queue_t* pqueue, void* pitem);
int ring_queue_pop(ring_queue_t* pqueue, void* pitem);
int ring_queue_is_empty(ring_queue_t* pqueue);
int ring_queue_is_full(ring_queue_t* pqueue);

#endif

// ring_queue.c
#include "ring_queue.h"

#include <stdlib.h>
#include <string.h>

#define CAS(ptr, old, new) __sync_bool_compare_and_swap(ptr, old, new)

int ring_queue_init(ring_queue_t* pqueue, int item_size, int capacity) {
    memset(pqueue, 0, sizeof(*pqueue));
    pqueue->pbuf = (char*)malloc(item_size * capacity);
    if (!pqueue->pbuf) {
        return -1;
    }

    pqueue->item_size = item_size;
    pqueue->capacity = capacity;
    pqueue->same_cycle = 1;
    return 0;
}

void ring_queue_destroy(ring_queue_t* pqueue) {
    free(pqueue->pbuf);
    memset(pqueue, 0, sizeof(*pqueue));
}


int ring_queue_push(ring_queue_t* pqueue, void* pitem) {
    // try to set write flag
    while (1) {
        if (ring_queue_is_full(pqueue)) {
            return -1;
        }

        if (CAS(&pqueue->write_flag, 0, 1)) {   // set write flag successfully
            break;
        }
    }

    // push data
    memcpy(pqueue->pbuf + pqueue->tail * pqueue->item_size, pitem, pqueue->item_size);
    pqueue->tail = (pqueue->tail + 1) % pqueue->capacity;
    if (0 == pqueue->tail) {    // a new cycle
        pqueue->same_cycle = 0;     // tail is not the same cycle with head
    }

    // reset write flag
    CAS(&pqueue->write_flag, 1, 0);

    return 0;
}

int ring_queue_pop(ring_queue_t* pqueue, void* pitem) {
    // try to set read flag
    while (1) {
        if (ring_queue_is_empty(pqueue)) {
            return -1;
        }

        if (CAS(&pqueue->read_flag, 0, 1)) {    // set read flag successfully
            break;
        }
    }

    // read data
    memcpy(pitem, pqueue->pbuf + pqueue->head * pqueue->item_size, pqueue->item_size);
    pqueue->head = (pqueue->head + 1) % pqueue->capacity;
    if (0 == pqueue->head) {
        pqueue->same_cycle = 1;     // head is now the same cycle with tail
    }

    // reset read flag
    CAS(&pqueue->read_flag, 1, 0);

    return 0;
}

int ring_queue_is_empty(ring_queue_t* pqueue) {
    return (pqueue->head == pqueue->tail) && pqueue->same_cycle;
}

int ring_queue_is_full(ring_queue_t* pqueue) {
    return (pqueue->head == pqueue->tail) && !pqueue->same_cycle;
}

My wechat is a powerful programmer. Welcome to forward it to the circle of friends and share it with more friends.

Topics: C data structure

Programmer Think

Hand tearing ring queue Series II: lock free to achieve high concurrency

Hot Topics