(the 14th day of punch in on November 25, 2021)
Learn this section to find out how Linux spinlocks and semaphores are implemented.
1. How to implement atomic variables in Linux
typedef struct { int counter; } atomic_t;//Common 32-bit atomic variable types #ifdef CONFIG_64BIT typedef struct { s64 counter; } atomic64_t;//64 bit atomic variable type #endif //The atom reads the value in the variable static __always_inline int arch_atomic_read(const atomic_t *v) { return __READ_ONCE((v)->counter); } //The atom writes a specific value static __always_inline void arch_atomic_set(atomic_t *v, int i) { __WRITE_ONCE(v->counter, i); } //Atom plus a specific value static __always_inline void arch_atomic_add(int i, atomic_t *v) { asm volatile(LOCK_PREFIX "addl %1,%0" : "+m" (v->counter) : "ir" (i) : "memory"); } //Atom minus a specific value static __always_inline void arch_atomic_sub(int i, atomic_t *v) { asm volatile(LOCK_PREFIX "subl %1,%0" : "+m" (v->counter) : "ir" (i) : "memory"); } //Atom plus 1 static __always_inline void arch_atomic_inc(atomic_t *v) { asm volatile(LOCK_PREFIX "incl %0" : "+m" (v->counter) :: "memory"); } //Atomic minus 1 static __always_inline void arch_atomic_dec(atomic_t *v) { asm volatile(LOCK_PREFIX "decl %0" : "+m" (v->counter) :: "memory"); }
Let's see__ READ_ONCE,__ WRITE_ What the once macros do is shown below.
#define __READ_ONCE(x) \ (*(const volatile __unqual_scalar_typeof(x) *)&(x)) #define __WRITE_ONCE(x, val) \ do {*(volatile typeof(x) *)&(x) = (val);} while (0) //__ unqual_scalar_typeof means to declare an unqualified scalar type, and the non scalar type remains unchanged. To say human words is to return the type of x variable, which is the function of GCC. typeof only returns the type of x. //Returns "int" if x is of type int #define __READ_ONCE(x) \ (*(const volatile int *)&(x)) #define __WRITE_ONCE(x, val) \ do {*(volatile int *)&(x) = (val);} while (0)
Combined with the code just now, I'll give you an interpretation. Linux defines__ READ_ONCE,__ WRITE_ONCE these two macros encapsulate the code and check the code by using the characteristics of GCC to show the errors in the compilation stage. The "volatile int *" is to remind the compiler that it is reading and writing memory addresses. There is no optimization action. It must be forced to write to or read from memory every time.
2. How to implement interrupt control
//Actually save eflags register extern __always_inline unsigned long native_save_fl(void){ unsigned long flags; asm volatile("# __raw_save_flags\n\t" "pushf ; pop %0":"=rm"(flags)::"memory"); return flags; } //Actual recovery eflags register extern inline void native_restore_fl(unsigned long flags){ asm volatile("push %0 ; popf"::"g"(flags):"memory","cc"); } //Actual shutdown interrupt static __always_inline void native_irq_disable(void){ asm volatile("cli":::"memory"); } //Actual start interrupt static __always_inline void native_irq_enable(void){ asm volatile("sti":::"memory"); } //arch gateway interrupt static __always_inline void arch_local_irq_disable(void){ native_irq_disable(); } //arch layer startup interrupt static __always_inline void arch_local_irq_enable(void){ native_irq_enable(); } //The arch layer holds the eflags register static __always_inline unsigned long arch_local_save_flags(void){ return native_save_fl(); } //The arch layer restores the eflags register static __always_inline void arch_local_irq_restore(unsigned long flags){ native_restore_fl(flags); } //Actually save the eflags register and turn off the interrupt static __always_inline unsigned long arch_local_irq_save(void){ unsigned long flags = arch_local_save_flags(); arch_local_irq_disable(); return flags; } //raw layer close open interrupt macro #define raw_local_irq_disable() arch_local_irq_disable() #define raw_local_irq_enable() arch_local_irq_enable() //The raw layer saves and restores the eflags register macro #define raw_local_irq_save(flags) \ do { \ typecheck(unsigned long, flags); \ flags = arch_local_irq_save(); \ } while (0) #define raw_local_irq_restore(flags) \ do { \ typecheck(unsigned long, flags); \ arch_local_irq_restore(flags); \ } while (0) #define raw_local_save_flags(flags) \ do { \ typecheck(unsigned long, flags); \ flags = arch_local_save_flags(); \ } while (0) //Common layer interface macro #define local_irq_enable() \ do { \ raw_local_irq_enable(); \ } while (0) #define local_irq_disable() \ do { \ raw_local_irq_disable(); \ } while (0) #define local_irq_save(flags) \ do { \ raw_local_irq_save(flags); \ } while (0) #define local_irq_restore(flags) \ do { \ raw_local_irq_restore(flags); \ } while (0)
3. Working principle and implementation of Linux spin lock
- Linux queued spin lock
//Spin lock data structure of RAW layer typedef struct raw_spinlock{ unsigned int slock;//Real lock value variable }raw_spinlock_t; //Top spin lock data structure typedef struct spinlock{ struct raw_spinlock rlock; }spinlock_t; //**Linux does not have such a structure, which is just for convenience of description** typedef struct raw_spinlock{ union { unsigned int slock;//Real lock value variable u16 owner; //False representation u16 next; //False representation } }raw_spinlock_t;
The slock field is divided into two parts to store the serial numbers of lock holders and future lock applicants respectively, as shown in lines 10 ~ 16 above.
It should be noted that in order to avoid differences, Linux uses spinlock_ The T structure contains raw_spinlock_t. And in raw_ spinlock_ The next and owner fields are not used in the T structure, but the upper 16 bits and lower 16 bits of slock are directly operated in the code.
Only when the next field is equal to the owner field, it indicates that the spin lock is in an unused state (no process applies for the lock at this time). When queuing for spin lock initialization, slock is set to 0, that is, next and owner are set to 0. When the Linux process applies for spin lock, it atomically adds 1 to the next field and returns the original value as its own serial number.
If the returned sequence number is equal to the owner value at the time of application, it indicates that the spin lock is in an unused state, and the process directly obtains the lock; Otherwise, the process loops to check whether the owner field is equal to the sequence number it holds. Once it is equal, it indicates that it is its turn to obtain the lock.
When the process releases the spin lock, it can add 1 to the owner domain atomically. The next process will find this change and exit from the loop state. The process will acquire the queued spin locks in strict order of application. In this way, the chaos of disorderly competition in the original process will be solved.
static inline void __raw_spin_lock(raw_spinlock_t*lock){ int inc = 0x00010000; int tmp; __asm__ __volatile__( "lock ; xaddl %0, %1\n" //Exchange inc and slock, and then inc=inc+slock //It is equivalent to the atom reading next and owner and adding next+1 "movzwl %w0, %2\n\t"//Extend the lower 16 bits of inc by 0 and send tmp tmp=(u16)inc "shrl $16, %0\n\t" //Shift inc right by 16 bits inc = inc > > 16 "1:\t" "cmpl %0, %2\n\t" //Compare inc and tmp, that is, compare next and owner "je 2f\n\t" //If equal, jump to label 2 and return "rep ; nop\n\t" //Null instruction "movzwl %1, %2\n\t" //Extend the lower 16 bits of slock by 0 and send tmp, that is, tmp=owner "jmp 1b\n" //Jump to label 1 and continue the comparison "2:" :"+Q"(inc),"+m"(lock->slock),"=r"(tmp) ::"memory","cc" ); } #define UNLOCK_LOCK_PREFIX LOCK_PREFIX static inline void __raw_spin_unlock(raw_spinlock_t*lock){ __asm__ __volatile__( UNLOCK_LOCK_PREFIX"incw %0"//Add 1 to the lower 16 bits of slock, that is, owner+1 :"+m"(lock->slock) ::"memory","cc"); }
The comments in the above code have been described very clearly. Each instruction has comments for your reference.
Xadd% 0,% 1 operation order is:
- Swap values for% 0% 1
- %1=%0 + %1
The result is that inc saves the value of the original slock, slock=slock + 0x00010000, that is, the next bit plus 1
- Process 1 application:
next=1, owner=0 (when applying, the original value of next is 0, and locking is successful);
xaddl 0x00010000, 0x00000000
=> slock = 0x00010000,inc = 0x00000000 - Process 2 application:
next=2, owner=0 (when applying, the original value of owner is 0, the original value of next is 1, locking fails, cyclic check);
xaddl 0x00010000, 0x00010000
=>Slock = 0x00100000, inc = 0x00010000 (Note: after xaddl, the upper 16 bits of inc are the original value before next plus 1)
(1) Compare slack.next and slack.owner. If they are not equal, (2) compare the high 16 bit 0001 of inc with slack.owner. If they are not equal, jump back to (1) and continue spin judgment. - Process 3 application:
next=3, owner=0 (when applying, the original value of owner is 0, the original value of next is 2, locking fails, and circular inspection is performed);
xaddl 0x00010000, 0x00100000
=>Slock = 0x00110000, inc = 0x00100000 (Note: after xaddl, the upper 16 bits of inc are the original value before next plus 1)
(1) Compare 0011 and the value 0000 of slack.owner, which are not equal, (2) compare the high 16 bit 0010 of inc with the value 0000 of slack.owner. If they are not equal, jump back to (1) to continue spin judgment. - Process 1 release:
owner=1 (released successfully);
slock = 0x00100001 - Process 2 application:
next=2, owner=1 (when applying, the original value of owner is 1, the original value of next is 1, and locking is successful);
(1) Compare the value 0010 of slack.next with the value 0001 of slack.owner. If they are not equal, (2) compare the upper 16 bits 0001 of inc with the value 0001 of slack.owner. The results are equal. The lock is successful and returned. - Process 3 application:
next=3, owner=1 (when applying, the original value of owner is 1, and the original value of next is 2. Locking fails, and circular inspection is performed);
(1) Compare 0011 and the value 0001 of slack.owner, which are not equal, (2) compare the high 16 bits 0010 of inc with the value 0001 of slack.owner, the results are not equal, and jump back to (1) to continue judgment.
When a process finds that another process already has its requested spin lock, it voluntarily gives up and does other work instead. It doesn't want to wait here and waste its time.
In this case, Linux also provides the corresponding spin lock interface, as shown below.
static inline int __raw_spin_trylock(raw_spinlock_t*lock){ int tmp; int new; asm volatile( "movl %2,%0\n\t"//tmp=slock "movl %0,%1\n\t"//new=tmp "roll $16, %0\n\t"//tmp loop shifts 16 bits to the left, that is, next and owner are exchanged "cmpl %0,%1\n\t"//Compare tmp with new, i.e. (owner, next)= (next,owner) "jne 1f\n\t" //If not, jump to label 1 "addl $0x00010000, %1\n\t"//Equivalent to next+1 "lock ; cmpxchgl %1,%2\n\t"//new and slock exchange comparison "1:" "sete %b1\n\t" //new = eflags.ZF bit, ZF depends on whether the previous judgment is equal "movzbl %b1,%0\n\t" //tmp = new :"=&a"(tmp),"=Q"(new),"+m"(lock->slock) ::"memory","cc"); return tmp; } int __lockfunc _spin_trylock(spinlock_t*lock){ preempt_disable(); if(_raw_spin_trylock(lock)){ spin_acquire(&lock->dep_map,0,1,_RET_IP_); return 1; } preempt_enable(); return 0; } #define spin_trylock(lock) __cond_lock(lock, _spin_trylock(lock))
-
Process 1 application:
tmp=slock=0
new=tmp=0
tmp circulates 16 bits to the left, tmp=0
Compare tmp and new, equal, no jump
"addl $0x00010000, new" => new=00010000
Cmpxchgl new, slock = > the value of eax is tmp. Eax and slock are equal, so slock=new=00010000, ZF=1
1 returned after locking successfully -
Process 2 application:
tmp=slock=00010000
new=tmp=00010000
tmp circulates 16 bits to the left, tmp=00000001
Compare tmp and new, not equal, ZF=0, skip to label 1
tmp=new=ZF=0
Locking failed, return 0 -
Process 3 application:
tmp=slock=00010000
new=tmp=00010000
tmp circulates 16 bits to the left, tmp=00000001
Compare tmp and new, not equal, ZF=0, skip to label 1
tmp=new=ZF=0
Locking failed, return 0 -
Process 1 release:
slock=00010000, so__ raw_ spin_ After the unlock function is executed, slock = 0001 -
Process 2 application:
tmp=slock=00010001
new=tmp=00010001
tmp circulates 16 bits to the left, tmp = 0001
Compare tmp and new, equal, no jump
"addl $0x00010000, new" => new=00020001
Cmpxchgl new, slock = > the value of eax is tmp. Eax and slock are equal, so slock=new=00020001, ZF=1
1 returned after locking successfully -
Process 3 application:
tmp=slock=00020001
new=tmp=00020001
tmp circulates 16 bits to the left, tmp=00010002
Compare tmp and new, not equal, ZF=0, skip to label 1
tmp=new=ZF=0
Locking failed, return 0
4. Data structure and implementation of Linux semaphore
When the value of the semaphore is positive. The requested process can lock and use it. If it is 0, it means that it is occupied by other processes. The requested process should enter the sleep queue and wait to be awakened. Therefore, the biggest advantage of semaphore is that it can not only make the application failed process sleep, but also be used as a resource counter.
Let's take a look at the data structure used by Linux to implement semaphores, as follows:
struct semaphore{ raw_spinlock_t lock;//Spin lock for protecting semaphore itself unsigned int count;//Semaphore value struct list_head wait_list;//Linked list for mounting sleep waiting process };
Next, we will follow the Linux semaphore interface function to explore the working principle of Linux semaphores and its impact on the process state step by step. First, let's take a look at the use cases of Linux semaphores, as shown below.
#define down_console_sem() do { \ down(&console_sem);\ } while (0) static void __up_console_sem(unsigned long ip) { up(&console_sem); } #define up_console_sem() __up_console_sem(_RET_IP_) //Lock console void console_lock(void) { might_sleep(); down_console_sem();//Get semaphore console_sem if (console_suspended) return; console_locked = 1; console_may_schedule = 1; } //Unlock console void console_unlock(void) { static char ext_text[CONSOLE_EXT_LOG_MAX]; static char text[LOG_LINE_MAX + PREFIX_MAX]; //... deleted a lot of code up_console_sem();//Release semaphore console_sem raw_spin_lock(&logbuf_lock); //... deleted a lot of code }
In order to explain the problem simply, I deleted a lot of code. In the above code, the console driver is taken as an example to illustrate the use of semaphores.
In kernel/printk.c of the Linux source code, use the macro DEFINE_SEMAPHORE declares a single valued semaphore console_sem, or mutex lock, is used to protect the console driver list_ Drivers and synchronize access to the entire console driver.
The macro down is defined_ console_sem () to get semaphore console_sem, which defines the macro up_console_sem() to release semaphore console_sem,console_lock and console_ The unlock function is used for mutually exclusive access to the console driver. The core operation is to call the two macros defined earlier.
In the above scenario, down_console_sem() and up_ console_ The core of the SEM () macro is to call the semaphore interface functions down and up to complete the core operations of obtaining and releasing semaphores. The code is as follows.
static inline int __sched __down_common(struct semaphore *sem, long state,long timeout) { struct semaphore_waiter waiter; //Add the waiter to SEM - > wait_ List header list_add_tail(&waiter.list, &sem->wait_list); waiter.task = current;//Current represents the current process, that is, the process calling the function waiter.up = false; for (;;) { if (signal_pending_state(state, current)) goto interrupted; if (unlikely(timeout <= 0)) goto timed_out; __set_current_state(state);//Set the status of the current process, i.e. process sleep, i.e. previous__ Task passed in the down function_ UNINTERRUPTIBLE: //This state wakes up when the resource is available (such as waiting for keyboard input, socket connection, signal, etc.), but it cannot be interrupted raw_spin_unlock_irq(&sem->lock);//Release the lock placed in the down function timeout = schedule_timeout(timeout);//Really go to sleep raw_spin_lock_irq(&sem->lock);//The next time the process runs, it will return here, so it needs to be locked if (waiter.up) return 0; } timed_out: list_del(&waiter.list); return -ETIME; interrupted: list_del(&waiter.list); return -EINTR; //For simplicity, I have removed the logic code that handles process signal s and timeouts } //Go to sleep and wait static noinline void __sched __down(struct semaphore *sem) { __down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } //Get semaphore void down(struct semaphore *sem) { unsigned long flags; //Lock the semaphore itself and turn off the interrupt. Maybe another piece of code is also operating the semaphore raw_spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--;//If the semaphore value is greater than 0, subtract 1 from it else __down(sem);//Otherwise, let the current process go to sleep raw_spin_unlock_irqrestore(&sem->lock, flags); } //Actual wake-up process static noinline void __sched __up(struct semaphore *sem) { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); //Get semaphore, the first data structure in the semaphore waiting list_ Waiter, which holds the pointer of the sleep process list_del(&waiter->list); waiter->up = true; wake_up_process(waiter->task);//The wake-up process rejoins the scheduling queue } //Release semaphore void up(struct semaphore *sem) { unsigned long flags; //Lock the semaphore itself and turn off the interrupt. Another piece of code must also operate the semaphore raw_spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++;//If the semaphore waiting list is empty, add 1 to the semaphore value else __up(sem);//Otherwise, perform operations related to the wake-up process raw_spin_unlock_irqrestore(&sem->lock, flags); }
The logic in the above code has described the working principle of semaphores. It should be noted that a process has entered__ In the down function, a non interruptible wait state is set, and then the schedule is executed_ Timeout function. The scheduler that executes the process directly schedules other processes to run.
At this point, the process will not return until the next time it is awakened by the up function. Wake was executed_ up_ After the process function is rescheduled, it will return to schedule_ The next line of code of the timeout function returns along the call path, and finally returns from the__ In the down function, the process wakes up.
5. Data structure and implementation of Linux read-write lock
A read-write lock is also called a shared exclusive lock. When a read-write lock is locked in the read mode, it is locked in the shared mode. When a read-write lock is locked in the write modify mode, it is locked in the exclusive mode (mutually exclusive).
Read / write locks are very suitable for scenarios where the frequency of reading data is much greater than that of modifying data. In this way, the read operations of multiple processes can be executed concurrently at any time, which brings a higher degree of concurrency to the system.
How does the read-write lock work? Reads and writes are mutually exclusive. They cannot be written when reading and cannot be read when writing. In addition, when reading and writing compete for locks, writes will get locks first. The steps are as follows.
- When the shared data has no lock, the read lock operation and the write lock operation can be satisfied.
- When the shared data has a read lock, all the read lock operations can be met, but the write lock operations cannot be met, and the read and write are mutually exclusive.
- When shared data has a write lock, all read lock operations cannot be satisfied, and all write lock operations cannot be satisfied. Read and write are mutually exclusive, and write and write are mutually exclusive.
//Read / write lock initialization lock value #define RW_LOCK_BIAS 0x01000000 //Underlying data structure of read-write lock typedef struct{ unsigned int lock; }arch_rwlock_t; //Release read lock static inline void arch_read_unlock(arch_rwlock_t*rw){ asm volatile( LOCK_PREFIX"incl %0" //Atomic lock plus 1 :"+m"(rw->lock)::"memory"); } //Release write lock static inline void arch_write_unlock(arch_rwlock_t*rw){ asm volatile( LOCK_PREFIX"addl %1, %0"//Atomic pair lock plus RW_LOCK_BIAS :"+m"(rw->lock):"i"(RW_LOCK_BIAS):"memory"); } //Called when obtaining a write lock fails ENTRY(__write_lock_failed) //(% eax) indicates that the memory space pointed by eax is passed in by the caller 2:LOCK_PREFIX addl $ RW_LOCK_BIAS,(%eax) 1:rep;nop//Null instruction cmpl $RW_LOCK_BIAS,(%eax) //If it is not equal to the initial value, it is compared circularly. If it is equal, it indicates that a process has released the write lock jne 1b //Execute write lock LOCK_PREFIX subl $ RW_LOCK_BIAS,(%eax) jnz 2b //If it is not 0, the test continues. If it is 0, the write lock is successfully added ret //return ENDPROC(__write_lock_failed) //Called when acquisition of read lock fails ENTRY(__read_lock_failed) //(% eax) indicates that the memory space pointed by eax is passed in by the caller 2:LOCK_PREFIX incl(%eax)//Atom plus 1 1: rep; nop//Null instruction cmpl $1,(%eax) //Less than 0 compared with 1 js 1b //If it is negative, the cyclic comparison continues LOCK_PREFIX decl(%eax) //Read lock js 2b //If it is negative, continue to add 1 and compare, otherwise return ret //return ENDPROC(__read_lock_failed) //Acquire read lock static inline void arch_read_lock(arch_rwlock_t*rw){ asm volatile( LOCK_PREFIX" subl $1,(%0)\n\t"//Atomic pair lock minus 1 "jns 1f\n"//If it is not less than 0, jump to label 1, indicating that the lock reading is successful "call __read_lock_failed\n\t"//Call__ read_lock_failed "1:\n" ::LOCK_PTR_REG(rw):"memory"); } //Get write lock static inline void arch_write_lock(arch_rwlock_t*rw){ asm volatile( LOCK_PREFIX"subl %1,(%0)\n\t"//Atomic pair lock minus RW_LOCK_BIAS "jz 1f\n"//If 0, jump to label 1 "call __write_lock_failed\n\t"//Call__ write_lock_failed "1:\n" ::LOCK_PTR_REG(rw),"i"(RW_LOCK_BIAS):"memory"); }
The principle and essence of Linux read-write lock is based on the counter. The initial value is 0x01000000. When obtaining the read lock, reduce it by 1. If the result is not less than 0, it means that the read lock is obtained successfully. 0x01000000 is directly subtracted when obtaining the write lock.
At this point, you may have to ask, why subtract the initial value? This is because only when the lock value is the initial value, the result of subtracting the initial value can be 0. This is the only case where no process holds any lock, so as to ensure that the write lock is mutually exclusive.
Let's comb the process of obtaining and releasing read-write locks again, as shown below.
- When obtaining a read lock, the lock value variable lock count subtracts 1 to determine whether the sign bit of the result is 1. If the result symbol bit is 0, the lock is successfully read, which means that the lock is greater than 0.
- When obtaining a read lock, the lock value variable lock count subtracts 1 to determine whether the sign bit of the result is 1. If the result symbol bit is 1, the read-write lock acquisition fails, indicating that the read-write lock is occupied by the process modifying the data. Call__ read_ lock_ The failed failure handler function loops through the value of lock+1 until the value of the result is greater than or equal to 1.
- When obtaining a write lock, the lock value variable lock count minus RW_LOCK_BIAS_STR, i.e. lock-0x01000000, judge whether the result is 0. If the result is 0, it indicates that the write lock is obtained successfully.
- When obtaining a write lock, the lock value variable lock count minus RW_LOCK_BIAS_STR, i.e. lock-0x01000000, judge whether the result is 0. If the result is not 0, obtaining the write lock fails, which means that a process reading data occupies the read lock or a process modifying data occupies the write lock. Call__ write_lock_failed processing function, loop test lock+0x01000000 until the value of the result is equal to 0x01000000.