This is bilibili - [End] 2020 Nanjing University "operating system: design and implementation" (Jiang Yanyan) Course notes for
Summary of this lecture: What is concurrency, why concurrency is needed, and a new understanding of concurrent programming Abandon the atomicity, sequence and visibility of the program
Concurrency and parallelism
Suppose the system has only one CPU
The operating system can load multiple programs (processes) at the same time
- Each process has an independent address space and will not interfere with each other
- Even processes with root privileges do not directly access the memory of the operating system kernel
- Switch to another process at regular intervals
Concurrency in multitasking operating systems:
Source of Concurrency: the process will call the API of the operating system
-
write(fd, bug, 1 TiB)
-
The implementation of write is part of the operating system
- After the x86-64 application executes syscall, it enters the operating system for execution (the application is not visible)
- Similar interrupt handler
- Run at the high privilege level of the processor: access to hardware devices (otherwise you can't write data)
- You can't keep the processor running (otherwise the system will get stuck)
-
Therefore, you must allow another process to execute halfway through write
- Another process calls read(fd, buf, 512 MiB) to read the same file
- The operating system API needs to consider concurrency
Concurrency: multiple execution flows may not be executed in a specific order
Parallelism: allows multiple execution streams to execute simultaneously (multiprocessor)
Number of processors | Shared memory | Typical concurrent system | Concurrent parallel |
---|---|---|---|
Single processor | Shared memory | OS kernel / multithreaded program | Concurrency not parallelism |
Multiprocessor | Shared memory | OS kernel / multithreaded program / GPU Kernel | Concurrent parallel |
Multiprocessor | Do not share memory | Distributed system (message communication) | Concurrent parallel |
Multiprocessor programming: getting started
thread
Threads: multiple execution streams execute concurrently / concurrently, and they share memory
- The two execution streams share code and all global variables (data, heap)
- The execution order between threads is non deterministic
int x = 0, y = 0; void thread_1(){ x = 1; // [1] printf("y = %d\n",y); // [2] } void thread_2(){ y = 1; // [3] printf("x = %d\n",x); // [4] }
1 - 2 - 3 - 4 (y=0,x=1)
1 - 3- 2 - 4 (y=1, x=1)
...
Threads: what should be shared and what should not be shared?
extern int x; int foo(){ int volatile t = x; t += 1; x = t; }
Consider which resources are shared if two execution flows call foo at the same time?
- Code for foo (1140_115f)
- This function can be called by all threads, so it is shared
- Register: rip\rsp\rax
- Variable: x:0x2eb5
- Data is shared between threads, so global variables are shared
In addition to code and global data, each thread's stack and register are their exclusive
POSIX Threads
- Using pthread_create creates and runs a thread
- Get several threads that share the current address space
- Using pthread_join waits for a thread to end
You can use man 7 pthreads to view the help documentation for pthreads
Help files:
man 1: user commands (executable commands and shell programs)
man 2: system call (kernel routine called from user space)
man 3: library function (provided by program library)
man 4: special documents (such as equipment documents)
man 5: file format (for many configuration files and structures)
man 6: Games (interesting program chapters in the past)
man 7: conventions, standards and others (protocols, file systems)
man 8: system management and privilege commands (maintenance tasks)
man 9: Linux kernel API (kernel call)
Whether the system is single processor or multiprocessor, it gets several threads that share the current process address space
- Shared code: the code of all threads comes from the code of the current process
- Shared data: global data / heap can be freely referenced
- Independent stack: each thread has an independent stack
threads.h: Simplified Thread APIs
create(fn)
- Create and run a thread that immediately starts executing the function fn
- Function prototype: void fn(int tid) {}
- tid number from 1
join(fn)
- End of execution of all threads
- Execute function fn
- You can only join once
threads.h implementation
Data structure:
struct thread { int id; // Thread number pthread_t thread; // Thread number in pthread thread api void (*entry)(int); // Entry address struct thread *next; // Linked list }; struct thread *threads; // Chain header void (*join_fn)(); // Callback function
Thread creation implementation:
static inline void *entry_all(void *arg) { struct thread *thread = (struct thread *)arg; thread->entry(thread->id); return NULL; } static inline void create(void *fn) { struct thread *cur = (struct thread *)malloc(sizeof(struct thread)); // Memory allocated to threads assert(cur); //Assume that the memory allocation is successful cur->id = threads ? threads->id + 1 : 1; // Assign thread number cur->next = threads; cur->entry = (void (*)(int))fn; threads = cur; pthread_create(&cur->thread, NULL, entry_all, cur); // Call posix's api }
Thread join implementation
static inline void join(void (*fn)()) { join_fn = fn; } __attribute__((destructor)) static void join_all() { for (struct thread *next; threads; threads = next) { // Wait for all threads to end pthread_join(threads->thread, NULL); next = threads->next; free(threads); } join_fn ? join_fn() : (void)0; // Call callback function }
thread.h
// threads.h #include <stdlib.h> #include <stdio.h> #include <assert.h> #include <pthread.h> struct thread { int id; pthread_t thread; void (*entry)(int); struct thread *next; }; struct thread *threads; void (*join_fn)(); // ========== Basics ========== __attribute__((destructor)) static void join_all() { for (struct thread *next; threads; threads = next) { pthread_join(threads->thread, NULL); next = threads->next; free(threads); } join_fn ? join_fn() : (void)0; } static inline void *entry_all(void *arg) { struct thread *thread = (struct thread *)arg; thread->entry(thread->id); return NULL; } static inline void create(void *fn) { struct thread *cur = (struct thread *)malloc(sizeof(struct thread)); assert(cur); cur->id = threads ? threads->id + 1 : 1; cur->next = threads; cur->entry = (void (*)(int))fn; threads = cur; pthread_create(&cur->thread, NULL, entry_all, cur); } static inline void join(void (*fn)()) { join_fn = fn; } // ========== Synchronization ========== #include <stdint.h> intptr_t atomic_xchg(volatile intptr_t *addr, intptr_t newval) { // swap(*addr, newval); intptr_t result; asm volatile ("lock xchg %0, %1": "+m"(*addr), "=a"(result) : "1"(newval) : "cc"); return result; } intptr_t locked = 0; static inline void lock() { while (1) { intptr_t value = atomic_xchg(&locked, 1); if (value == 0) { break; } } } static inline void unlock() { atomic_xchg(&locked, 0); } #include <semaphore.h> #define P sem_wait #define V sem_post #define SEM_INIT(sem, val) sem_init(&(sem), 0, val)
Getting started with multithreading
Write a program and introduce thread h
There are two reference methods:
-
Thread H and the source program are placed in the same folder, #include"thread.h"
-
Use - I path to add an include pathgcc -I. a.c
❯ gcc -I. a.c /usr/bin/ld: /tmp/ccGTJbHj.o: in function `join_all': a.c:(.text+0x22): undefined reference to `pthread_join' /usr/bin/ld: /tmp/ccGTJbHj.o: in function `create': a.c:(.text+0x150): undefined reference to `pthread_create' collect2: error: ld returned 1 exit status
It can be seen that the path has been modified or there are errors
This is because we call pthread in our code, but there is no link to pthread.
Use the - l option to link the pthread library
❯ gcc a.c -l pthread ❯ ls a.c a.out
Compilation succeeded
Using ldd to view a.out, you can see that pthread library is required
❯ ldd a.out linux-vdso.so.1 (0x00007ffc25dcc000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe681014000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe680e2a000) /lib64/ld-linux-x86-64.so.2 (0x00007fe68104d000)
Execute a.out to see ab interval printing.
More chestnuts
- How do you believe that multiple threads are actually started?
// hello-mt.c #include "threads.h" void f() { static int x = 0; printf("Hello from thread #%d\n", x++); while (1); } int main() { for (int i = 0; i < 10; i++) { create(f); } join(NULL); }
Create 10 and threads, execute function f, and output the current thread number each time.
❯ gcc hello-mt.c -l pthread ❯ ./a.out Hello from thread #1 Hello from thread #2 Hello from thread #3 Hello from thread #6 Hello from thread #0 Hello from thread #4 Hello from thread #7 Hello from thread #8 Hello from thread #5 Hello from thread #9 ^C
We can find that 10 threads have been created, which proves that all threads share memory and share variable x
However, it is not created in order. We will explain this problem in future courses
- Know the stack range size of each thread
// stack-probe.c #include "threads.h" __thread char *base, *now; // One variable per thread __thread int id; /* base Is the approximate base address of the thread now Is the current address of the thread stack id Is the thread number */ // objdump to see how thread local variables are implemented void set_base(char *ptr) { base = ptr; } void set_now(char *ptr) { now = ptr; } void set_base(char *ptr) { base = ptr; } void set_now(char *ptr) { now = ptr; } void *get_base() { return &base; } void *get_now() { return &now; } void stackoverflow(int n) { // infinite recursion char x; if (n == 0) set_base(&x); set_now(&x); if (n % 1024 == 0) { // Every 1024 cycles, the thread ID, the number of recursions, the base address and the address difference are printed printf("[T%d] Stack size @ n = %d: %p +%ld KiB\n", id, n, base, (base - now) / 1024); } stackoverflow(n + 1); } void probe(int tid) { id = tid; printf("[%d] thread local address %p\n", id, &base); stackoverflow(0); } int main() { setbuf(stdout, NULL); for (int i = 0; i < 4; i++) { // Four threads are created, and each thread executes probe create(probe); } join(NULL); }
❯ gcc stack-probe.c -l pthread ❯ ./a.out [1] thread local address 0x7f52bb69d628 [T1] Stack size @ n = 0: 0x7f52bb69cdf7 +0 KiB [2] thread local address 0x7f52bae9c628 [T1] Stack size @ n = 1024: 0x7f52bb69cdf7 +48 KiB [T2] Stack size @ n = 0: 0x7f52bae9bdf7 +0 KiB [T1] Stack size @ n = 2048: 0x7f52bb69cdf7 +96 KiB [4] thread local address 0x7f52b9e9a628 [T4] Stack size @ n = 0: 0x7f52b9e99df7 +0 KiB [T2] Stack size @ n = 1024: 0x7f52bae9bdf7 +48 KiB [T1] Stack size @ n = 3072: 0x7f52bb69cdf7 +144 KiB [T4] Stack size @ n = 1024: 0x7f52b9e99df7 +48 KiB [T2] Stack size @ n = 2048: 0x7f52bae9bdf7 +96 KiB ... ... ... [T2] Stack size @ n = 151552: 0x7f52bae9bdf7 +7104 KiB [T1] Stack size @ n = 173056: 0x7f52bb69cdf7 +8112 KiB [T3] Stack size @ n = 148480: 0x7f52ba69adf7 +6960 KiB [T4] Stack size @ n = 174080: 0x7f52b9e99df7 +8160 KiB [T1] Stack size @ n = 174080: 0x7f52bb69cdf7 +8160 KiB [T2] Stack size @ n = 152576: 0x7f52bae9bdf7 +7152 KiB [1] 37437 segmentation fault (core dumped) ./a.out
When the stack frame size reaches about 8M, a segment error occurs, and the memory of the thread is insufficient.
Using pmap, you can view the 8192KiB memory mapping area and the guard of 4KiB (one page)
Command free
- -b displays memory usage in bytes.
- -k displays memory usage in kilobytes.
- -m displays memory usage in MB.
- -h displays the memory usage in appropriate units, up to three digits, and automatically calculates the corresponding unit value. The units are:
- -o buffer reconciliation columns are not displayed.
- -S < interval seconds > continuously observe the memory usage.
- -t displays the memory sum column.
- -V displays version information.
❯ free -h total used free shared buff/cache available Mem: 7.6Gi 2.3Gi 2.6Gi 452Mi 2.8Gi 4.6Gi Swap: 9.8Gi 0B 9.8Gi
My total memory is less than 8G. Each thread needs to allocate 8MiB memory. Then why didn't 1000 threads run out of memory?
When you apply for a piece of memory, the operating system does not immediately give you an actual memory, but just gives you a logical address. When you really access this address, the operating system will map the logical address to the physical address.
Modify hello-mt.c to 3 threads, run in the background, and observe its memory mapping relationship with pmap
#include "threads.h" void f() { static int x = 0; printf("Hello from thread #%d\n", x++); while (1); // to make sure we're not calling f() for ten times } int main() { for (int i = 0; i < 3; i++) { create(f); } join(NULL); }
❯ gcc hello-mt.c -l pthread ❯ ./a.out & [1] 38856 Hello from thread #0 Hello from thread #1 Hello from thread #2 ❯ pmap 38856 38856: ./a.out 0000558e10c44000 4K r---- a.out 0000558e10c45000 4K r-x-- a.out 0000558e10c46000 4K r---- a.out 0000558e10c47000 4K r---- a.out 0000558e10c48000 4K rw--- a.out 0000558e1287c000 132K rw--- [ anon ] 00007fd1a8000000 132K rw--- [ anon ] 00007fd1a8021000 65404K ----- [ anon ] 00007fd1ae049000 4K ----- [ anon ] 00007fd1ae04a000 8192K rw--- [ anon ] 00007fd1ae84a000 4K ----- [ anon ] 00007fd1ae84b000 8192K rw--- [ anon ] 00007fd1af04b000 4K ----- [ anon ] 00007fd1af04c000 8204K rw--- [ anon ] 00007fd1af84f000 152K r---- libc-2.32.so 00007fd1af875000 1460K r-x-- libc-2.32.so ...
It can be seen that there is a 4k unreadable and writable area above and below each 8192k stack. When the stack overflows or overflows, an error will be reported.
Command pmap
Linux pmap command is used to report the memory mapping relationship of processes. It is a good tool for Linux debugging, operation and maintenance.
-x: Display extended format
-d: Display device format
-q: Do not display header and footer lines
-5: Displays the specified version.
Multiprocessor programming: discard
Abandon atomicity
With the increase of the number of threads, concurrency control becomes more and more difficult
Case 1: transfer
int pay(int money){ if (deposit > money){ deposit -= money; return SUCCESS; } else { return FAIL; } }
The hardware implementation of deposit is:
int tmp = deposit; //May be concurrent with other processors tmp -= money; //May be concurrent with other processors deposit = tmp;
For example:
Thread 1:
[1]if(deposit>money)
[3] deposit -= money;
Thread 2:
[2]if(deposit>money)
[4] deposit -= money;
If deposit = = 100 and money = = 100, it will be judged as successful in steps 1 and 2. After 3.4, deposit will become - 100. Obviously unreasonable.
Case 2: multithreaded summation
2n 1s are calculated in two threads
#include"threads.h" #define n 10000000 long sum = 0; void do_sum(){ for(int i = 0; i < n; i++) sum++; } void print() {printf("sum = %ld\n", sum);} int main(){ create(do_sum); create(do_sum); join(print); }
❯ gcc sum.c -l pthread ❯ ./a.out sum = 10261926
The answer should be 20000000, but the result was unexpected.
In one case, sum + + is compiled into: t=sum;t++;sum=t
In this case:
Thread 1: [1]t=sum; [5]t++; [6]sum=t
Thread 2: [2]t=sum; [3]t++; [4]sum=t
t is a register, which is exclusive to each thread.
sum=0; After steps 1, 2, 3, 4, 5 and 6, sum cycles twice, but it is added only once. sum=1.
This is not only true in multiprocessors, but also in single processors, due to program interrupts.
Even the simplest x + + does not guarantee atomicity.
Abandon sequencing
Let's compile this 2n file with different compilation optimization levels
❯ gcc -O0 sum.c -o sum-0.out -l pthread ❯ gcc -O1 sum.c -o sum-1.out -l pthread ❯ gcc -O2 sum.c -o sum-2.out -l pthread ❯ ./sum-0.out sum = 10129839 ❯ ./sum-1.out sum = 10000000 ❯ ./sum-2.out sum = 20000000
The output results of the three compilation optimization levels are different. Why?
Check the code objdump - D sum-0.0 compiled from the three files out | less
# Optimization level - 0 0000000000001380 <do_sum>: 1380: f3 0f 1e fa endbr64 1384: 55 push %rbp 1385: 48 89 e5 mov %rsp,%rbp 1388: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 138f: eb 16 jmp 13a7 <do_sum+0x27> 1391: 48 8b 05 98 2c 00 00 mov 0x2c98(%rip),%rax # 4030 <sum> 1398: 48 83 c0 01 add $0x1,%rax 139c: 48 89 05 8d 2c 00 00 mov %rax,0x2c8d(%rip) # 4030 <sum> 13a3: 83 45 fc 01 addl $0x1,-0x4(%rbp) 13a7: 81 7d fc 7f 96 98 00 cmpl $0x98967f,-0x4(%rbp) 13ae: 7e e1 jle 1391 <do_sum+0x11> 13b0: 90 nop 13b1: 90 nop 13b2: 5d pop %rbp 13b3: c3 retq # Optimization level 1 0000000000001203 <do_sum>: 1203: f3 0f 1e fa endbr64 1207: 48 8b 15 0a 2e 00 00 mov 0x2e0a(%rip),%rdx # 4018 <sum> 120e: 48 8d 42 01 lea 0x1(%rdx),%rax 1212: 48 81 c2 81 96 98 00 add $0x989681,%rdx 1219: 48 89 c1 mov %rax,%rcx 121c: 48 83 c0 01 add $0x1,%rax 1220: 48 39 d0 cmp %rdx,%rax 1223: 75 f4 jne 1219 <do_sum+0x16> 1225: 48 89 0d ec 2d 00 00 mov %rcx,0x2dec(%rip) # 4018 <sum> 122c: c3 retq # Optimization level 2 00000000000012a0 <do_sum>: 12a0: f3 0f 1e fa endbr64 12a4: 48 81 05 69 2d 00 00 addq $0x989680,0x2d69(%rip) # 4018 <sum> 12ab: 80 96 98 00 12af: c3 retq
The compiler will make certain modifications to the program we write. These modifications are no problem in sequential execution, but they will produce bad results in concurrent execution.
Discard visibility
A short piece of code before:
int x = 0, y = 0; void thread_1(){ x = 1; // [1] printf("y = %d\n",y); // [2] } void thread_2(){ y = 1; // [3] printf("x = %d\n",x); // [4] }
Discard visibility.
reason:
In order to make the CPU run faster, the CPU can execute commands out of order
movl $1, (x) # x = 1, cache miss # If this instruction is executed, a lot of time will be wasted movl (y), %eax # Therefore, as long as xy is not the same variable, the CPU will execute this command immediately # y = 0
Modern processors:
- If two instructions have no data dependency, let them execute in parallel
- Out of order execution
- The results executed on multiple processors may not be equivalent to the results of instructions executed in a certain order
The execution of the code is more complex than we think
In modern computer systems, even a simple x = 1 will experience:
- C code
- Compiler optimization - > loss of order
- Binary file
- Processor execution
- Terminal / execution - > loss of atomicity
- Out of order execution - > loss of visibility
Shared memory concurrency becomes a real challenge:
- Memory access is not guaranteed to occur in the order in which the code is written
- The atomicity of the code is destroyed at any time
- Executed instructions may not be visible between multiprocessors
Guarantee sequence:
Use volatile keyword
void delay(){ for (volatile int i = 0; i < DELAY_COUNT; i++); }
Ensure the order of memory access
extern int x; #define barrier() asm volatile("" ::: "memory") void foo(){ x++; barrier(); // Access to organization x is merged x++; // The access of y cannot be moved before the barrier y++; }
Guarantee atomicity
stop_ the_ After the world() function is executed, all other threads of the whole system are suspended
resume_ the_ The world() function is used to restore other threads after execution