[concurrency 1] multiprocessor programming: from getting started to giving up

Posted by amal.barman on Sat, 15 Jan 2022 00:25:49 +0100

This is bilibili - [End] 2020 Nanjing University "operating system: design and implementation" (Jiang Yanyan) Course notes for

Summary of this lecture:
What is concurrency, why concurrency is needed, and a new understanding of concurrent programming
 Abandon the atomicity, sequence and visibility of the program

Concurrency and parallelism

Suppose the system has only one CPU

The operating system can load multiple programs (processes) at the same time

Each process has an independent address space and will not interfere with each other
Even processes with root privileges do not directly access the memory of the operating system kernel
Switch to another process at regular intervals

Concurrency in multitasking operating systems:

Source of Concurrency: the process will call the API of the operating system

write(fd, bug, 1 TiB)
The implementation of write is part of the operating system
- After the x86-64 application executes syscall, it enters the operating system for execution (the application is not visible)
- Similar interrupt handler
- Run at the high privilege level of the processor: access to hardware devices (otherwise you can't write data)
- You can't keep the processor running (otherwise the system will get stuck)
Therefore, you must allow another process to execute halfway through write
- Another process calls read(fd, buf, 512 MiB) to read the same file
- The operating system API needs to consider concurrency

Concurrency: multiple execution flows may not be executed in a specific order

Parallelism: allows multiple execution streams to execute simultaneously (multiprocessor)

Number of processors	Shared memory	Typical concurrent system	Concurrent parallel
Single processor	Shared memory	OS kernel / multithreaded program	Concurrency not parallelism
Multiprocessor	Shared memory	OS kernel / multithreaded program / GPU Kernel	Concurrent parallel
Multiprocessor	Do not share memory	Distributed system (message communication)	Concurrent parallel

Multiprocessor programming: getting started

thread

Threads: multiple execution streams execute concurrently / concurrently, and they share memory

The two execution streams share code and all global variables (data, heap)
The execution order between threads is non deterministic

int x = 0, y = 0;

void thread_1(){
    x  = 1; // [1]
    printf("y = %d\n",y); // [2]
}

void thread_2(){
    y = 1; // [3]
    printf("x = %d\n",x); // [4]
}

1 - 2 - 3 - 4 (y=0,x=1)

1 - 3- 2 - 4 (y=1, x=1)

...

Threads: what should be shared and what should not be shared?

extern int x;
int foo(){
    int volatile t = x;
    t += 1;
    x = t;
}

Consider which resources are shared if two execution flows call foo at the same time?

Code for foo (1140_115f)
- This function can be called by all threads, so it is shared
Register: rip\rsp\rax
Variable: x:0x2eb5
- Data is shared between threads, so global variables are shared

In addition to code and global data, each thread's stack and register are their exclusive

POSIX Threads

Using pthread_create creates and runs a thread
- Get several threads that share the current address space
Using pthread_join waits for a thread to end

You can use man 7 pthreads to view the help documentation for pthreads

Help files:

man 1: user commands (executable commands and shell programs)

man 2: system call (kernel routine called from user space)

man 3: library function (provided by program library)

man 4: special documents (such as equipment documents)

man 5: file format (for many configuration files and structures)

man 6: Games (interesting program chapters in the past)

man 7: conventions, standards and others (protocols, file systems)

man 8: system management and privilege commands (maintenance tasks)

man 9: Linux kernel API (kernel call)

Whether the system is single processor or multiprocessor, it gets several threads that share the current process address space

Shared code: the code of all threads comes from the code of the current process
Shared data: global data / heap can be freely referenced
Independent stack: each thread has an independent stack

threads.h: Simplified Thread APIs

create(fn)

Create and run a thread that immediately starts executing the function fn
Function prototype: void fn(int tid) {}
tid number from 1

join(fn)

End of execution of all threads
Execute function fn
You can only join once

threads.h implementation

Data structure:

struct thread {
  int id; // Thread number
  pthread_t thread; // Thread number in pthread thread api
  void (*entry)(int); // Entry address
  struct thread *next; // Linked list
};

struct thread *threads; // Chain header
void (*join_fn)(); // Callback function

Thread creation implementation:

static inline void *entry_all(void *arg) {
  struct thread *thread = (struct thread *)arg;
  thread->entry(thread->id);
  return NULL;
}

static inline void create(void *fn) {
  struct thread *cur = (struct thread *)malloc(sizeof(struct thread)); // Memory allocated to threads
  assert(cur); //Assume that the memory allocation is successful
  cur->id    = threads ? threads->id + 1 : 1; // Assign thread number
  cur->next  = threads;
  cur->entry = (void (*)(int))fn;
  threads    = cur;
  pthread_create(&cur->thread, NULL, entry_all, cur); // Call posix's api
}

Thread join implementation

static inline void join(void (*fn)()) {
  join_fn = fn;
}

__attribute__((destructor)) static void join_all() {
  for (struct thread *next; threads; threads = next) {
      // Wait for all threads to end
    pthread_join(threads->thread, NULL);
    next = threads->next;
    free(threads);
  }
  join_fn ? join_fn() : (void)0; // Call callback function
}

thread.h

// threads.h

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <pthread.h>

struct thread {
  int id;
  pthread_t thread;
  void (*entry)(int);
  struct thread *next;
};

struct thread *threads;
void (*join_fn)();

// ========== Basics ==========

__attribute__((destructor)) static void join_all() {
  for (struct thread *next; threads; threads = next) {
    pthread_join(threads->thread, NULL);
    next = threads->next;
    free(threads);
  }
  join_fn ? join_fn() : (void)0;
}

static inline void *entry_all(void *arg) {
  struct thread *thread = (struct thread *)arg;
  thread->entry(thread->id);
  return NULL;
}

static inline void create(void *fn) {
  struct thread *cur = (struct thread *)malloc(sizeof(struct thread));
  assert(cur);
  cur->id    = threads ? threads->id + 1 : 1;
  cur->next  = threads;
  cur->entry = (void (*)(int))fn;
  threads    = cur;
  pthread_create(&cur->thread, NULL, entry_all, cur);
}

static inline void join(void (*fn)()) {
  join_fn = fn;
}

// ========== Synchronization ==========

#include <stdint.h>

intptr_t atomic_xchg(volatile intptr_t *addr,
                               intptr_t newval) {
  // swap(*addr, newval);
  intptr_t result;
  asm volatile ("lock xchg %0, %1":
    "+m"(*addr), "=a"(result) : "1"(newval) : "cc");
  return result;
}

intptr_t locked = 0;

static inline void lock() {
  while (1) {
    intptr_t value = atomic_xchg(&locked, 1);
    if (value == 0) {
      break;
    }
  }
}

static inline void unlock() {
  atomic_xchg(&locked, 0);
}

#include <semaphore.h>

#define P sem_wait
#define V sem_post
#define SEM_INIT(sem, val) sem_init(&(sem), 0, val)

Getting started with multithreading

Write a program and introduce thread h

There are two reference methods:

Thread H and the source program are placed in the same folder, #include"thread.h"

Use - I path to add an include pathgcc -I. a.c

❯ gcc -I. a.c
/usr/bin/ld: /tmp/ccGTJbHj.o: in function `join_all':
a.c:(.text+0x22): undefined reference to `pthread_join'
/usr/bin/ld: /tmp/ccGTJbHj.o: in function `create':
a.c:(.text+0x150): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status

It can be seen that the path has been modified or there are errors

This is because we call pthread in our code, but there is no link to pthread.

Use the - l option to link the pthread library

❯ gcc a.c -l pthread
❯ ls
a.c  a.out

Compilation succeeded

Using ldd to view a.out, you can see that pthread library is required

❯ ldd a.out
	linux-vdso.so.1 (0x00007ffc25dcc000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe681014000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe680e2a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fe68104d000)

Execute a.out to see ab interval printing.

More chestnuts

How do you believe that multiple threads are actually started?

// hello-mt.c
#include "threads.h"

void f() {
  static int x = 0;
  printf("Hello from thread #%d\n", x++);
  while (1);
}

int main() {
  for (int i = 0; i < 10; i++) {
    create(f);
  }
  join(NULL);
}

Create 10 and threads, execute function f, and output the current thread number each time.

❯ gcc hello-mt.c -l pthread
❯ ./a.out
Hello from thread #1
Hello from thread #2
Hello from thread #3
Hello from thread #6
Hello from thread #0
Hello from thread #4
Hello from thread #7
Hello from thread #8
Hello from thread #5
Hello from thread #9
^C

We can find that 10 threads have been created, which proves that all threads share memory and share variable x

However, it is not created in order. We will explain this problem in future courses

Know the stack range size of each thread

// stack-probe.c
#include "threads.h"

__thread char *base, *now; // One variable per thread
__thread int id;
/*
base Is the approximate base address of the thread
now Is the current address of the thread stack
id Is the thread number
*/

// objdump to see how thread local variables are implemented
void set_base(char *ptr) { base = ptr; }
void set_now(char *ptr)  { now = ptr; }
void set_base(char *ptr) { base = ptr; }
void set_now(char *ptr)  { now = ptr; }
void *get_base()         { return &base; }
void *get_now()          { return &now; }

void stackoverflow(int n) {
  // infinite recursion 
  char x;
  if (n == 0) set_base(&x);
  set_now(&x);
  if (n % 1024 == 0) {
  // Every 1024 cycles, the thread ID, the number of recursions, the base address and the address difference are printed
    printf("[T%d] Stack size @ n = %d: %p +%ld KiB\n",
      id, n, base, (base - now) / 1024);
  }
  stackoverflow(n + 1); 
}

void probe(int tid) {
  id = tid;
  printf("[%d] thread local address %p\n", id, &base);
  stackoverflow(0);
}

int main() {
  setbuf(stdout, NULL);
  for (int i = 0; i < 4; i++) { // Four threads are created, and each thread executes probe
    create(probe);
  }
  join(NULL);
}

❯ gcc stack-probe.c -l pthread
❯ ./a.out
[1] thread local address 0x7f52bb69d628
[T1] Stack size @ n = 0: 0x7f52bb69cdf7 +0 KiB
[2] thread local address 0x7f52bae9c628
[T1] Stack size @ n = 1024: 0x7f52bb69cdf7 +48 KiB
[T2] Stack size @ n = 0: 0x7f52bae9bdf7 +0 KiB
[T1] Stack size @ n = 2048: 0x7f52bb69cdf7 +96 KiB
[4] thread local address 0x7f52b9e9a628
[T4] Stack size @ n = 0: 0x7f52b9e99df7 +0 KiB
[T2] Stack size @ n = 1024: 0x7f52bae9bdf7 +48 KiB
[T1] Stack size @ n = 3072: 0x7f52bb69cdf7 +144 KiB
[T4] Stack size @ n = 1024: 0x7f52b9e99df7 +48 KiB
[T2] Stack size @ n = 2048: 0x7f52bae9bdf7 +96 KiB
...
...
...
[T2] Stack size @ n = 151552: 0x7f52bae9bdf7 +7104 KiB
[T1] Stack size @ n = 173056: 0x7f52bb69cdf7 +8112 KiB
[T3] Stack size @ n = 148480: 0x7f52ba69adf7 +6960 KiB
[T4] Stack size @ n = 174080: 0x7f52b9e99df7 +8160 KiB
[T1] Stack size @ n = 174080: 0x7f52bb69cdf7 +8160 KiB
[T2] Stack size @ n = 152576: 0x7f52bae9bdf7 +7152 KiB
[1]    37437 segmentation fault (core dumped)  ./a.out

When the stack frame size reaches about 8M, a segment error occurs, and the memory of the thread is insufficient.

Using pmap, you can view the 8192KiB memory mapping area and the guard of 4KiB (one page)

Command free

-b displays memory usage in bytes.
-k displays memory usage in kilobytes.
-m displays memory usage in MB.
-h displays the memory usage in appropriate units, up to three digits, and automatically calculates the corresponding unit value. The units are:
-o buffer reconciliation columns are not displayed.
-S < interval seconds > continuously observe the memory usage.
-t displays the memory sum column.
-V displays version information.

❯ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.6Gi       2.3Gi       2.6Gi       452Mi       2.8Gi       4.6Gi
Swap:         9.8Gi          0B       9.8Gi

My total memory is less than 8G. Each thread needs to allocate 8MiB memory. Then why didn't 1000 threads run out of memory?

When you apply for a piece of memory, the operating system does not immediately give you an actual memory, but just gives you a logical address. When you really access this address, the operating system will map the logical address to the physical address.

Modify hello-mt.c to 3 threads, run in the background, and observe its memory mapping relationship with pmap

#include "threads.h"

void f() {
  static int x = 0;
  printf("Hello from thread #%d\n", x++);
  while (1); // to make sure we're not calling f() for ten times
}

int main() {
  for (int i = 0; i < 3; i++) {
    create(f);
  }
  join(NULL);
}

❯ gcc hello-mt.c -l pthread
❯ ./a.out &
[1] 38856
Hello from thread #0                                                                 
Hello from thread #1
Hello from thread #2
❯ pmap 38856
38856:   ./a.out
0000558e10c44000      4K r---- a.out
0000558e10c45000      4K r-x-- a.out
0000558e10c46000      4K r---- a.out
0000558e10c47000      4K r---- a.out
0000558e10c48000      4K rw--- a.out
0000558e1287c000    132K rw---   [ anon ]
00007fd1a8000000    132K rw---   [ anon ]
00007fd1a8021000  65404K -----   [ anon ]
00007fd1ae049000      4K -----   [ anon ]
00007fd1ae04a000   8192K rw---   [ anon ]
00007fd1ae84a000      4K -----   [ anon ]
00007fd1ae84b000   8192K rw---   [ anon ]
00007fd1af04b000      4K -----   [ anon ]
00007fd1af04c000   8204K rw---   [ anon ]
00007fd1af84f000    152K r---- libc-2.32.so
00007fd1af875000   1460K r-x-- libc-2.32.so
...

It can be seen that there is a 4k unreadable and writable area above and below each 8192k stack. When the stack overflows or overflows, an error will be reported.

Command pmap

Linux pmap command is used to report the memory mapping relationship of processes. It is a good tool for Linux debugging, operation and maintenance.

-x: Display extended format
-d: Display device format
-q: Do not display header and footer lines
-5: Displays the specified version.

Multiprocessor programming: discard

Abandon atomicity

With the increase of the number of threads, concurrency control becomes more and more difficult

Case 1: transfer

int pay(int money){
    if (deposit > money){
        deposit -= money;
        return SUCCESS;
    } else {
        return FAIL;
    }
}

The hardware implementation of deposit is:

int tmp = deposit;
//May be concurrent with other processors
tmp -= money;
//May be concurrent with other processors
deposit = tmp;

For example:

Thread 1:

[1]if(deposit>money)

[3] deposit -= money;

Thread 2:

[2]if(deposit>money)

[4] deposit -= money;

If deposit = = 100 and money = = 100, it will be judged as successful in steps 1 and 2. After 3.4, deposit will become - 100. Obviously unreasonable.

Case 2: multithreaded summation

2n 1s are calculated in two threads

#include"threads.h"

#define n 10000000
long sum = 0;

void do_sum(){ for(int i = 0; i < n; i++) sum++; }
void print() {printf("sum = %ld\n", sum);}

int main(){
	create(do_sum);
	create(do_sum);
	join(print);
}

❯ gcc sum.c -l pthread
❯ ./a.out
sum = 10261926

The answer should be 20000000, but the result was unexpected.

In one case, sum + + is compiled into: t=sum;t++;sum=t

In this case:

Thread 1: [1]t=sum; [5]t++; [6]sum=t

Thread 2: [2]t=sum; [3]t++; [4]sum=t

t is a register, which is exclusive to each thread.

sum=0； After steps 1, 2, 3, 4, 5 and 6, sum cycles twice, but it is added only once. sum=1.

This is not only true in multiprocessors, but also in single processors, due to program interrupts.

Even the simplest x + + does not guarantee atomicity.

Abandon sequencing

Let's compile this 2n file with different compilation optimization levels

❯ gcc -O0 sum.c -o sum-0.out -l pthread
❯ gcc -O1 sum.c -o sum-1.out -l pthread
❯ gcc -O2 sum.c -o sum-2.out -l pthread
❯ ./sum-0.out
sum = 10129839
❯ ./sum-1.out
sum = 10000000
❯ ./sum-2.out
sum = 20000000

The output results of the three compilation optimization levels are different. Why?

Check the code objdump - D sum-0.0 compiled from the three files out | less

# Optimization level - 0
0000000000001380 <do_sum>:
    1380:       f3 0f 1e fa             endbr64 
    1384:       55                      push   %rbp
    1385:       48 89 e5                mov    %rsp,%rbp
    1388:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
    138f:       eb 16                   jmp    13a7 <do_sum+0x27>
    1391:       48 8b 05 98 2c 00 00    mov    0x2c98(%rip),%rax        # 4030 <sum>
    1398:       48 83 c0 01             add    $0x1,%rax
    139c:       48 89 05 8d 2c 00 00    mov    %rax,0x2c8d(%rip)        # 4030 <sum>
    13a3:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
    13a7:       81 7d fc 7f 96 98 00    cmpl   $0x98967f,-0x4(%rbp)
    13ae:       7e e1                   jle    1391 <do_sum+0x11>
    13b0:       90                      nop
    13b1:       90                      nop
    13b2:       5d                      pop    %rbp
    13b3:       c3                      retq   


# Optimization level 1
0000000000001203 <do_sum>:
    1203:       f3 0f 1e fa             endbr64 
    1207:       48 8b 15 0a 2e 00 00    mov    0x2e0a(%rip),%rdx        # 4018 <sum>
    120e:       48 8d 42 01             lea    0x1(%rdx),%rax
    1212:       48 81 c2 81 96 98 00    add    $0x989681,%rdx
    1219:       48 89 c1                mov    %rax,%rcx
    121c:       48 83 c0 01             add    $0x1,%rax
    1220:       48 39 d0                cmp    %rdx,%rax
    1223:       75 f4                   jne    1219 <do_sum+0x16>
    1225:       48 89 0d ec 2d 00 00    mov    %rcx,0x2dec(%rip)        # 4018 <sum>
    122c:       c3                      retq 

# Optimization level 2
00000000000012a0 <do_sum>:
    12a0:       f3 0f 1e fa             endbr64 
    12a4:       48 81 05 69 2d 00 00    addq   $0x989680,0x2d69(%rip)        # 4018 <sum>
    12ab:       80 96 98 00 
    12af:       c3                      retq

The compiler will make certain modifications to the program we write. These modifications are no problem in sequential execution, but they will produce bad results in concurrent execution.

Discard visibility

A short piece of code before:

int x = 0, y = 0;

void thread_1(){
    x  = 1; // [1]
    printf("y = %d\n",y); // [2]
}

void thread_2(){
    y = 1; // [3]
    printf("x = %d\n",x); // [4]
}

Discard visibility.

reason:

In order to make the CPU run faster, the CPU can execute commands out of order

movl	$1, (x)    # x = 1, cache miss
				   # If this instruction is executed, a lot of time will be wasted
movl	(y), %eax  # Therefore, as long as xy is not the same variable, the CPU will execute this command immediately
				   # y = 0

Modern processors:

If two instructions have no data dependency, let them execute in parallel
Out of order execution
- The results executed on multiple processors may not be equivalent to the results of instructions executed in a certain order

The execution of the code is more complex than we think

In modern computer systems, even a simple x = 1 will experience:

C code
- Compiler optimization - > loss of order
Binary file
Processor execution
- Terminal / execution - > loss of atomicity
- Out of order execution - > loss of visibility

Shared memory concurrency becomes a real challenge:

Memory access is not guaranteed to occur in the order in which the code is written
The atomicity of the code is destroyed at any time
Executed instructions may not be visible between multiprocessors

Guarantee sequence:

Use volatile keyword

void delay(){
    for (volatile int i = 0; i < DELAY_COUNT; i++);
}

Ensure the order of memory access

extern int x;

#define barrier() asm volatile("" ::: "memory")

void foo(){
    x++;
    barrier(); // Access to organization x is merged
    x++; // The access of y cannot be moved before the barrier
    y++;
}

Guarantee atomicity

stop_ the_ After the world() function is executed, all other threads of the whole system are suspended

resume_ the_ The world() function is used to restore other threads after execution

Topics: C C++ Linux Operating System

Programmer Think