Python quick start multithreading and multiprocessing

Posted by hthighway on Wed, 09 Feb 2022 00:58:57 +0100

Python quick start multithreading and multiprocessing

 

Multithreading

Meaning of multithreading

Process can be understood as a program unit that can run independently. For example, opening a browser opens a browser process; Open a text editor, which opens a text editor process. But a process can handle many things at the same time. For example, in the browser, we can open multiple pages in multiple tabs. Some pages are playing music, some pages are playing video, and some web pages are playing animation. They can run at the same time without interfering with each other. Why can you run so many tasks at the same time? Here we need to introduce the concept of thread. In fact, each task actually corresponds to the execution of each thread.

And the process? It is a collection of threads. A process is composed of one or more threads. Threads are the smallest unit of operation scheduling by the operating system and the smallest running unit in the process. For example, in the browser process mentioned above, playing music is a thread and playing video is also a thread. Of course, there are many other threads running at the same time. The concurrent or parallel execution of these threads finally enables the whole browser to run so many tasks at the same time.

After understanding the concept of thread, multithreading is easy to understand. Multithreading is the simultaneous execution of multiple threads in a process. The browser scenario mentioned above is a typical multithreaded execution.

 

Concurrency and parallelism

A program runs in a computer, and its underlying processor is realized by running instructions one by one.

Concurrency is called concurrency in English. It means that only one instruction can be executed at the same time, but the corresponding instructions of multiple threads are executed in rapid rotation. For example, thread A executes instructions for A period of time, and then switches back to A period of time when thread B executes instructions.

Because the speed of instruction execution and switching of the processor are very, very fast, people can't perceive that there are multiple threads switching the operation executed by the context in this process, which makes it look as if multiple threads are running at the same time. But on the micro level, only the processor is continuously switching and executing among multiple threads. The execution of each thread will occupy a time segment of the processor. In fact, there is only one thread executing at the same time.

 

Parallel is called parallel in English. It means that at the same time, multiple instructions are executed on multiple processors at the same time, and parallelism must depend on multiple processors. Whether macroscopically or microscopically, multiple threads are executed together at the same time.

Parallelism can only exist in multiprocessor systems. If our computer processor has only one core, it is impossible to realize parallelism. Concurrency can exist in both single processor and multiprocessor systems, because concurrency can be realized by only one core.

 

For example, the system processor needs to run multiple threads at the same time. If the system processor has only one core, it can only run these threads in a concurrent manner. If the system processor has multiple cores, when one core is executing one thread, the other core can execute another thread. In this way, the two threads can execute in parallel. Of course, other threads may also execute on the same core as other threads, which is concurrent execution. It depends on the operation mode of the system.

 

Multithreading scenario

In a program process, some operations are time-consuming or need to wait, such as waiting for the return of database query results and the response of web page results. If a single thread is used, the processor must wait until these operations are completed before continuing to perform other operations. While the thread is waiting, the processor can obviously perform other operations. If multithreading is used, the processor can execute other threads while one thread is waiting, so as to improve the execution efficiency as a whole.

Like the above scenario, threads need to wait in many cases during execution. For example, a web crawler is a very typical example. After a crawler sends a request to the server, it must wait for the response of the server to return for a period of time. This kind of task belongs to IO intensive task. For such tasks, if we enable multithreading, the processor can handle other tasks while a thread is waiting, so as to improve the overall crawling efficiency.

However, not all tasks are IO intensive. There is also a task called computing intensive task, which can also be called CPU intensive task. As the name suggests, the operation of the task always needs the participation of the processor. At this time, if we turn on multithreading and switch a processor from one computing intensive task to another, the processor will still not stop and will always be busy with computing, which will not save the overall time, because the total amount of computing of the tasks to be processed remains the same. If the number of threads is too large, it will spend more time in the process of thread switching, and the overall efficiency will become low.

Therefore, if the tasks are not all computation intensive tasks, we can use multithreading to improve the overall execution efficiency of the program. Especially for IO intensive tasks such as web crawlers, the use of multithreading will greatly improve the overall crawling efficiency of the program.

 

Python implements multithreading

In Python, the module that implements multithreading is called threading, which is a module of Python itself.

 

Thread directly creates a child thread

First, we can use threading Thread to create a thread. When creating, you need to specify the target parameter as the name of the running method. If the called method needs to pass in additional parameters, you can specify it through the args parameter of thread. Examples are as follows:

import threading
import time


def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

for i in [1, 5]:
    thread = threading.Thread(target=target, args=[i])
    thread.start()

print(f'Threading {threading.current_thread().name} is ended')


##Output result:
Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 1s
Threading Thread-2 is running Threading MainThread is ended

Threading Thread-2 sleep 5s
Threading Thread-1 is ended
Threading Thread-2 is ended

Here, we first declare a method called target, which receives a parameter of second. Through the implementation of the method, we can find that this method actually executes a time Sleep sleep operation. The second parameter is the number of sleep seconds. Some contents are print ed before and after it. The name of the Thread is passed through threading current_ Thread(). Name. If it is a main Thread, its value is MainThread. If it is a sub Thread, its value is Thread - *.

Then we create two new threads through Thead class. The target parameter is the method name we just defined, and args is passed in the form of list. In the two loops, i here is 1 and 5 respectively, so that the two threads sleep for 1 second and 5 seconds respectively. After the declaration is completed, we call the start method to start the thread.

According to the observation results, we can find that there are three threads in total, namely, the main thread and two sub threads, Thread-1 and Thread-2. In addition, we observed that the main thread runs first, and then Thread-1 and Thread-2 run successively, with an interval of 1 second and 4 seconds respectively. This shows that the main thread did not wait for the sub thread to finish running, but quit directly, which is a little unreasonable.

 

If we want the main thread to wait for the sub thread to run before exiting, we can make each sub thread object call the next join method, as follows:

import threading
import time


def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

threads = []
for i in [1, 5]:
    thread = threading.Thread(target=target, args=[i])
    threads.append(thread) ##Add thread to thread list
    thread.start()

for thread in threads:
    thread.join() #Wait until the thread terminates. This blocks the calling thread until the thread's join() method is called to terminate - normal exit or throw an unhandled exception - or an optional timeout occurs

print(f'Threading {threading.current_thread().name} is ended')


##Output result:
Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 1s
Threading Thread-2 is running
Threading Thread-2 sleep 5s
Threading Thread-1 is ended
Threading Thread-2 is ended
Threading MainThread is ended

The main thread must wait for all the sub threads to run and end before the main thread can continue to run and end.

 

Inherit the Thread class to create a child Thread

In addition, we can also create a Thread by inheriting the Thread class. The methods that the Thread needs to execute can be written in the run method of the class. The equivalent of the above example is rewritten as:

import threading
import time

class MyThread(threading.Thread):
    def __init__(self, second):
        threading.Thread.__init__(self)
        self.second = second

    def run(self):
        print(f'Threading {threading.current_thread().name} is running')
        print(f'Threading {threading.current_thread().name} sleep {self.second}s')
        time.sleep(self.second)
        print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')
threads = []

for i in [1, 5]:
    thread = MyThread(i)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

print(f'Threading {threading.current_thread().name} is ended')


##Output result:
Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 1s
Threading Thread-2 is running
Threading Thread-2 sleep 5s
Threading Thread-1 is ended
Threading Thread-2 is ended
Threading MainThread is ended

 

Daemon thread

There is a concept called daemon thread in threads. If a thread is set as a daemon thread, it means that the thread is "unimportant", which means that if the main thread ends and the daemon thread has not finished running, it will be forced to end. In Python, we can set a thread as a daemon through the setDaemon method.

import threading
import time


def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

t1 = threading.Thread(target=target, args=[2])
t1.start()

t2 = threading.Thread(target=target, args=[5])
t2.setDaemon(True)
t2.start()

print(f'Threading {threading.current_thread().name} is ended')


##Output result:
Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 2s
Threading Thread-2 is running Threading MainThread is ended

Threading Thread-2 sleep 5s
Threading Thread-1 is ended

Set t2 as the guard thread through the setDaemon method, so that when the main thread finishes running, t2 thread will end with the end of thread t1.

Note: the join method is not called here. If we let both t1 and t2 call the join method, the main thread will still wait for the execution of each sub thread before exiting, whether it is a guard thread or not.

import threading
import time

def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

t1 = threading.Thread(target=target, args=[2])
t1.start()

t2 = threading.Thread(target=target, args=[5])
t2.setDaemon(True)
t2.start()

t1.join()        ##Execute the join() method for both t1 and t2 threads
t2.join()

print(f'Threading {threading.current_thread().name} is ended')

##Output result:
Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 2s
Threading Thread-2 is running
Threading Thread-2 sleep 5s
Threading Thread-1 is ended
Threading Thread-2 is ended
Threading MainThread is ended

 

mutex

Multiple threads in a process share resources. For example, in a process, there is a global variable count to count. Now we declare multiple threads and add 1 to count when each thread runs. Let's see the effect. The code implementation is as follows:

import threading
import time

count = 0

class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        global count
        temp = count + 1
        time.sleep(0.001)
        count = temp

threads = []
for _ in range(10000):
    thread = MyThread()
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()
print(f'Final count: {count}')

##Output result:
Final count: 65

Note that the original code here declares 1000 threads, but the final effect is not very good, so it is changed to 10000 threads here.

Here, we declare 10000 threads. Each thread gets the current global variable count value, then sleeps for a short period of time, and then assigns a new value to count.

 

In this way, according to common sense, the final count value should be 10000. But the final result is only 65, and the results are different when running multiple times or changing the environment. Why?

Because the value of count is shared, each thread can get the current count value when executing the code of temp = count. However, some threads in these threads may execute concurrently or in parallel, which may lead to different threads getting the same count value. Finally, the count plus 1 operation of some threads does not take effect, The final result is too small.

 

Therefore, if multiple threads read or modify a data at the same time, unexpected results will occur. To avoid this situation, we need to synchronize multiple threads. To achieve synchronization, we can lock and protect the data to be operated. Threading is needed here Lock.

What does lock protection mean? In other words, a thread needs to lock before operating on the data, so that other threads cannot continue to execute downward after they find that they are locked, and will wait for the lock to be released. Only when the locked thread releases the lock, can other threads continue to lock and modify the data, and then release the lock after modification. This can ensure that only one thread operates data at the same time, and multiple threads will not read and modify the same data at the same time, so the final running result is right.

We can modify the code as follows:

import threading
import time

count = 0

class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        global count
        lock.acquire()   ##Lock
        temp = count + 1
        time.sleep(0.001)
        count = temp
        lock.release()   ##Release lock

lock = threading.Lock() ##Declare a lock object
threads = []
for _ in range(10000):
    thread = MyThread()
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()
print(f'Final count: {count}')


##Output result:
Final count: 1000

Here we declare a lock object, which is actually threading An instance of lock, and then in the run method, add the lock before obtaining the count, and release the lock after modifying the count, so that multiple threads will not obtain and modify the value of count at the same time.

For the content of Python multithreading, let's introduce these first. For more usage methods of theading, such as semaphores and queues, please refer to the official documents: https://docs.python.org/zh-cn/3.7/library/threading.html#module-threading.

Python multithreading problem

Due to the limitation of GIL in Python, only one thread can run at the same time under the conditions of single core or multi-core, resulting in Python multithreading unable to give full play to the advantage of multi-core parallelism. The full name of GIL is Global Interpreter Lock, which is translated into Global Interpreter Lock in Chinese. It was originally designed for data security.

In Python multithreading, the execution mode of each thread is as follows:

  • Get GIL
  • Execute the code of the corresponding thread
  • Release GIL

It can be seen that if a thread wants to execute, it must get GIL first. We can regard GIL as a pass, and there is only one GIL in a Python process. Threads that cannot get a pass are not allowed to execute. This will cause that even under the condition of multi-core, multiple threads in a Python process can only execute one thread at the same time.

However, for IO intensive tasks such as crawlers, this problem has little impact. For computing intensive tasks, due to the existence of GIL, the overall running efficiency of multithreading may be lower than that of single thread.

 

Multi process

Meaning of multi process

Process is a program with certain independent functions. It is a running activity on a data set. It is an independent unit for resource allocation and scheduling.

As the name suggests, multi process is to enable multiple processes to run at the same time. Because processes are a collection of threads, and processes are composed of one or more threads, the operation of multiple processes means that there are more than or equal to the number of threads running.

 

Advantages of Python multi process

Due to the existence of GIL in the process, multithreading in Python can not give full play to the advantage of multi-core. Multiple threads in a process can only have one thread running at the same time. For multi-process, each process has its own GIL. Therefore, under multi-core processor, the operation of multi-process will not be affected by GIL. Therefore, multi-process can give better play to the advantages of multi-core.

Of course, for IO intensive tasks such as crawlers, the impact of multithreading and multiprocessing is not very different. For computing intensive tasks, Python's multi-core operation efficiency will be doubled compared with multithreading.

In general, Python's multiprocessing has more advantages than multithreading as a whole. Therefore, if conditions permit, use multiple processes as much as possible.

It is worth noting that the system cannot share data among processes independently. However, it is worth noting that a single process cannot share data among processes.

 

Implementation of multi process

There is also a built-in Library in Python to implement multiprocessing, which is multiprocessing.

multiprocessing provides a series of components, such as Process, Queue, Semaphore, Pipe, Lock, Pool, etc. Let's learn how to use them.

 

Use the Process class directly

In multiprocessing, each Process is represented by a Process class. Its API calls are as follows:

Process([group [, target [, name [, args [, kwargs]]]]])
  • target means the calling object. You can pass in the name of the method.
  • Args represents the location parameter tuple of the called object. For example, target is a function func. It has two parameters m and N, so args can be passed in [m, n].
  • kwargs represents the dictionary of the calling object.
  • Name is an alias, which is equivalent to giving the process a name.
  • group.
import multiprocessing

def process(index):
    print(f'Process: {index}')

if __name__ == '__main__':
    for i in range(5):
        p = multiprocessing.Process(target=process, args=(i,)) #Here (I,) or [i] will do
        p.start()


##Output result:
Process: 0
Process: 2
Process: 1
Process: 3
Process: 4

This is the most basic way to realize multi process: create a new sub process by creating process. The target parameter is passed in the method name, and args is the parameter of the method, which is passed in the form of tuple, which corresponds to the parameter of the called method process one by one.

Note: args here must be a tuple. If there is only one parameter, add a comma after the first element of the tuple. If there is no comma, it is no different from the single element itself and cannot form a tuple, resulting in problems in parameter transmission.

After creating the process, we can start the process by calling the start method.

As you can see, we run five sub processes, and each process calls the process method. The last parameter is the sequence number of 0 ~ 5 sub processes passed in by the Arg process method, and the last parameter is the sequence number of 0 ~ 5 sub processes, which is printed by the Arg process method.

 

Because processes are the smallest resource allocation unit in Python, these processes are different from threads. Data between processes will not be shared. Resources will be allocated independently every time a process is started. In addition, when the current number of CPU cores is sufficient, these different processes will be allocated to different CPU cores to run, so as to realize real parallel execution.

multiprocessing also provides several useful methods. For example, we can use cpu_count method to obtain the number of cores of the current machine CPU through active_ The children method gets all the processes currently running.

import multiprocessing
import time

def process(index):
    time.sleep(index)
    print(f'Process: {index}')

if __name__ == '__main__':
    for i in range(5):
        p = multiprocessing.Process(target=process, args=[i])  ## Note that either [i] or (I,) is OK here
        p.start()

    print(f'CPU number: {multiprocessing.cpu_count()}')

    for p in multiprocessing.active_children():
        print(f'Child process name: {p.name} id: {p.pid}')

    print('Process Ended')

##Output result:
CPU number: 16
Child process name: Process-5 id: 24888
Child process name: Process-4 id: 24868
Child process name: Process-2 id: 24828
Child process name: Process-3 id: 24848
Child process name: Process-1 id: 24820
Process Ended
Process: 0
Process: 1
Process: 2
Process: 3
Process: 4

In the above example, we use the cpu_count successfully obtained the number of CPU cores: 16. Of course, the results may be different for different machines.

In addition, we also use active_children gets the list of currently active processes. Then we traverse each process and print out their name and process number. Here, the process number can be obtained directly by using the pid attribute, and the process name can be obtained directly by using the name attribute.

 

Inherit Process class

In the above example, we create a Process directly using the Process class, which is a way to create a Process. However, there are more than one way to create a Process. Similarly, we can also create a Process class by inheritance like Thread. The basic operation of the Process can be implemented in the run method of the subclass.

from multiprocessing import Process
import time

class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid}, LoopCount: {count}')

if __name__ == '__main__':
    for i in range(2, 5):
        p = MyProcess(i)
        p.start()


##Output result:
Pid: 29268, LoopCount: 0Pid: 29276, LoopCount: 0
Pid: 29296, LoopCount: 0

Pid: 29268, LoopCount: 1Pid: 29296, LoopCount: 1
Pid: 29276, LoopCount: 1

Pid: 29296, LoopCount: 2Pid: 29276, LoopCount: 2
Pid: 29296, LoopCount: 3

We first declare a construction method, which receives a loop parameter representing the number of cycles and sets it as a global variable. In the run method, the loop variable is used to loop times, and the current process number and cycle times are printed.

At the time of invocation, we obtained 2, 3, 4 three figures by range method, and initialized the MyProcess process separately, then called the start method to start the process.

Note: the execution logic of the process here needs to be implemented in the run method. To start the process, you need to call the start method. After calling, the run method will execute.

 

It can be seen that the three processes printed 2, 3 and 4 results respectively, that is, process 29268 printed 2 results, process 29276 printed 3 results and process 296 printed 4 results.

Note that the process pid here represents the process number, and the running results may be different for different machines and at different times.

Through the above method, we also easily implement the definition of a process. For the convenience of reuse, we can write some methods in each process class and encapsulate them. When in use, we can directly initialize a process class to run.

 

Daemon

In multiple processes, the concept of daemon also exists. If a process is set as a daemon, the child process will be automatically terminated when the parent process ends. We can control whether it is a daemon by setting the daemon attribute.

from multiprocessing import Process
import time

class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')

if __name__ == '__main__':
    for i in range(2, 5):
        p = MyProcess(i)
        p.daemon = True
        p.start()

    print('Main Process ended')

##Output result:
Main Process ended

The result is very simple. Because the main process doesn't do anything and directly outputs a sentence, it also directly terminates the operation of the sub process at this time. This can effectively prevent uncontrolled generation of child processes. This way of writing allows us not to worry about whether the sub process is closed after the main process is running, avoiding the operation of independent sub processes.

 

Process waiting

In fact, the above operation effect is not quite in line with our expectations: when the main process is running, the child processes (daemons) also exit, and the child processes have no time to execute anything.

Can we finish all the subprocesses and then finish them? Of course, you can. Just add the join method. We can rewrite the code as follows:

from multiprocessing import Process
import time

class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')

if __name__ == '__main__':
    processes = []

    for i in range(2, 5):
        p = MyProcess(i)
        processes.append(p)
        p.daemon = True
        p.start()

    for p in processes:
        p.join()

    print('Main Process ended')


##Output result:
Pid: 35900 LoopCount: 0Pid: 35908 LoopCount: 0Pid: 13964 LoopCount: 0


Pid: 35908 LoopCount: 1Pid: 13964 LoopCount: 1
Pid: 35900 LoopCount: 1

Pid: 35900 LoopCount: 2Pid: 13964 LoopCount: 2

Pid: 13964 LoopCount: 3
Main Process ended

After calling the start and join methods, the parent process can wait for all child processes to finish executing, and then print out the finished results.

By default, the join is indefinite. In other words, if a child process does not finish running, the main process will wait all the time. In this case, if the child process has problems and falls into an endless loop, the main process will wait indefinitely. How to solve this problem? You can pass a timeout parameter to the join method, representing the maximum number of seconds to wait. If the child process does not complete within the specified number of seconds, it will be forced back, and the main process will no longer wait. That is, this parameter sets the maximum time that the main process waits for the child process.

For example, here we pass in 1, which means the longest waiting time is 1 second. The code is rewritten as follows:

from multiprocessing import Process
import time

class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')

if __name__ == '__main__':
    processes = []

    for i in range(2, 5):
        p = MyProcess(i)
        processes.append(p)
        p.daemon = True
        p.start()

    for p in processes:
        p.join(1)

    print('Main Process ended')

##Output result:
Pid: 35156 LoopCount: 0
Pid: 35184 LoopCount: 0Pid: 35148 LoopCount: 0

Pid: 35148 LoopCount: 1Pid: 35184 LoopCount: 1
Pid: 35156 LoopCount: 1

Main Process ended

It can be seen that some child processes were supposed to run for 3 seconds, but they were forced back after running for 1 second. Because they were daemons, the child process was terminated.

 

Terminate process

Of course, there is more than a daemon to terminate a process. We can also terminate a child process through the terminate method. In addition, we can also terminate a child process through is_ The alive method determines whether the process is still running.

import multiprocessing
import time

def process():
    print('Starting')
    time.sleep(5)
    print('Finished')

if __name__ == '__main__':
    p = multiprocessing.Process(target=process)
    print('Before:', p, p.is_alive())

    p.start()
    print('During:', p, p.is_alive())

    p.terminate()
    print('Terminate:', p, p.is_alive())

    p.join()
    print('Joined:', p, p.is_alive())

##Output result:
Before: <Process(Process-1, initial)> False
During: <Process(Process-1, started)> True
Terminate: <Process(Process-1, started)> True
Joined: <Process(Process-1, stopped[SIGTERM])> False

In the above example, we use Process to create a process, then call the start method to start the process, then call the terminate method to terminate the process, and finally call the join method. In addition, at different stages of the process, we also use is_ The alive method determines whether the current process is still running.

Note: after calling the terminate method, we use is_ The alive method obtains the status of the process and finds that it is still running. After calling the join method, is_ The alive method gets the running state of the process before it changes to the termination state.

Therefore, after calling the terminate method, remember to call the join method. Calling the join method here can provide time for the process to update the object state to reflect the final process termination effect.

 

Process mutex

In some of the above examples (in the example of process waiting), we may encounter the following running results:

##Output result:
Pid: 35156 LoopCount: 0
Pid: 35184 LoopCount: 0Pid: 35148 LoopCount: 0

Pid: 35148 LoopCount: 1Pid: 35184 LoopCount: 1
Pid: 35156 LoopCount: 1

Main Process ended

We found that some output results did not wrap. What causes this?

The output of the first process is not output in parallel. As a result, the output of the second process is not output at the same time.

 

So how to avoid this problem?

If we can guarantee that at any time during the operation of multiple processes, only one process can output, and other processes wait. After the output of that process just now is completed, another process can output again, so that there will be no line break in the output.

This solution actually realizes the mutual exclusion of processes and avoids multiple processes from seizing the critical area (output) resources at the same time. We can implement it through lock in multiprocessing. Lock, that is, lock. When one process outputs, it locks and other processes wait. After the execution of this process, release the lock and other processes can output.

 

We first implement an example without locking. The code is as follows:

from multiprocessing import Process, Lock
import time

class MyProcess(Process):
    def __init__(self, loop, lock):
        Process.__init__(self)
        self.loop = loop
        self.lock = lock

    def run(self):
        for count in range(self.loop):
            time.sleep(0.1)
            # self.lock.acquire()
            print(f'Pid: {self.pid} LoopCount: {count}')
            # self.lock.release()

if __name__ == '__main__':
    lock = Lock()
    for i in range(10, 15):
        p = MyProcess(i, lock)
        p.start()

##Output result:
Pid: 51428 LoopCount: 0Pid: 51448 LoopCount: 0

Pid: 51392 LoopCount: 0Pid: 51388 LoopCount: 0
Pid: 51460 LoopCount: 0

Pid: 51448 LoopCount: 1
Pid: 51428 LoopCount: 1
Pid: 51392 LoopCount: 1
Pid: 51460 LoopCount: 1Pid: 51388 LoopCount: 1

Pid: 51448 LoopCount: 2Pid: 51428 LoopCount: 2

Pid: 51392 LoopCount: 2
Pid: 51460 LoopCount: 2Pid: 51388 LoopCount: 2

Pid: 51428 LoopCount: 3Pid: 51448 LoopCount: 3

Pid: 51388 LoopCount: 3Pid: 51460 LoopCount: 3Pid: 51392 LoopCount: 3


Pid: 51448 LoopCount: 4Pid: 51428 LoopCount: 4

Pid: 51388 LoopCount: 4
Pid: 51460 LoopCount: 4Pid: 51392 LoopCount: 4

Pid: 51428 LoopCount: 5Pid: 51448 LoopCount: 5

Pid: 51460 LoopCount: 5
Pid: 51388 LoopCount: 5Pid: 51392 LoopCount: 5

Pid: 51428 LoopCount: 6Pid: 51448 LoopCount: 6

Pid: 51388 LoopCount: 6
Pid: 51392 LoopCount: 6Pid: 51460 LoopCount: 6

Pid: 51428 LoopCount: 7Pid: 51448 LoopCount: 7

Pid: 51460 LoopCount: 7Pid: 51388 LoopCount: 7

Pid: 51392 LoopCount: 7
Pid: 51448 LoopCount: 8
Pid: 51428 LoopCount: 8
Pid: 51388 LoopCount: 8Pid: 51460 LoopCount: 8

Pid: 51392 LoopCount: 8
Pid: 51448 LoopCount: 9Pid: 51428 LoopCount: 9

Pid: 51460 LoopCount: 9Pid: 51388 LoopCount: 9
Pid: 51392 LoopCount: 9

Pid: 51428 LoopCount: 10
Pid: 51460 LoopCount: 10
Pid: 51392 LoopCount: 10Pid: 51388 LoopCount: 10

Pid: 51428 LoopCount: 11
Pid: 51388 LoopCount: 11
Pid: 51392 LoopCount: 11
Pid: 51392 LoopCount: 12
Pid: 51388 LoopCount: 12
Pid: 51392 LoopCount: 13

You can see that some outputs in the running results have the problem of no line break. Remove the comments in the above code and run it again. The output results are as follows:

Pid: 50820 LoopCount: 0
Pid: 50752 LoopCount: 0
Pid: 50776 LoopCount: 0
Pid: 50832 LoopCount: 0
Pid: 50792 LoopCount: 0
Pid: 50792 LoopCount: 1
Pid: 50832 LoopCount: 1
Pid: 50752 LoopCount: 1
Pid: 50820 LoopCount: 1
Pid: 50776 LoopCount: 1
Pid: 50752 LoopCount: 2
Pid: 50792 LoopCount: 2
Pid: 50820 LoopCount: 2
Pid: 50832 LoopCount: 2
Pid: 50776 LoopCount: 2
Pid: 50792 LoopCount: 3
Pid: 50832 LoopCount: 3
Pid: 50776 LoopCount: 3
Pid: 50820 LoopCount: 3
Pid: 50752 LoopCount: 3
Pid: 50832 LoopCount: 4
Pid: 50752 LoopCount: 4
Pid: 50792 LoopCount: 4
Pid: 50776 LoopCount: 4
Pid: 50820 LoopCount: 4
Pid: 50792 LoopCount: 5
Pid: 50752 LoopCount: 5
Pid: 50776 LoopCount: 5
Pid: 50820 LoopCount: 5
Pid: 50832 LoopCount: 5
Pid: 50832 LoopCount: 6
Pid: 50792 LoopCount: 6
Pid: 50820 LoopCount: 6
Pid: 50776 LoopCount: 6
Pid: 50752 LoopCount: 6
Pid: 50792 LoopCount: 7
Pid: 50820 LoopCount: 7
Pid: 50776 LoopCount: 7
Pid: 50752 LoopCount: 7
Pid: 50832 LoopCount: 7
Pid: 50820 LoopCount: 8
Pid: 50792 LoopCount: 8
Pid: 50832 LoopCount: 8
Pid: 50752 LoopCount: 8
Pid: 50776 LoopCount: 8
Pid: 50832 LoopCount: 9
Pid: 50776 LoopCount: 9
Pid: 50752 LoopCount: 9
Pid: 50792 LoopCount: 9
Pid: 50820 LoopCount: 9
Pid: 50832 LoopCount: 10
Pid: 50820 LoopCount: 10
Pid: 50776 LoopCount: 10
Pid: 50752 LoopCount: 10
Pid: 50752 LoopCount: 11
Pid: 50776 LoopCount: 11
Pid: 50820 LoopCount: 11
Pid: 50752 LoopCount: 12
Pid: 50820 LoopCount: 12
Pid: 50752 LoopCount: 13

At this time, the output effect is normal. While using some critical resource areas, Lock can avoid the problem of using some resources at the same time.

Semaphore

Process mutex can enable only one process to access shared resources at the same time. As shown in the above example, only one process can output results at the same time. But sometimes we need to allow multiple processes to access shared resources, and we also need to limit the number of processes that can access shared resources.

How can this demand be realized? Semaphores can be used. Semaphores play an important role in the process of process synchronization. It can control the number of critical resources, realize multiple processes to access shared resources at the same time, and limit the concurrency of processes. Semaphores can be implemented with Semaphore in multiprocessing library.

 

Next, let's use an example to demonstrate how to use Semaphore to share resources among multiple processes, while limiting the number of processes that can be accessed at the same time. The code is as follows: / / in the windows environment, the expected results are not obtained after running.

from multiprocessing import Process, Semaphore, Lock, Queue
import time

buffer = Queue(10)
empty = Semaphore(2)
full = Semaphore(0)
lock = Lock()

class Consumer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            full.acquire()
            lock.acquire()
            buffer.get()
            print('Consumer pop an element')
            time.sleep(1)
            lock.release()
            empty.release()

class Producer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            empty.acquire()
            lock.acquire()
            buffer.put(1)
            print('Producer append an element')
            time.sleep(1)
            lock.release()
            full.release()

if __name__ == '__main__':
    p = Producer()
    c = Consumer()
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

##Output result:
    Producer append an element
    Producer append an element
    Consumer pop an element
    Consumer pop an element
    Producer append an element
    Producer append an element
    Consumer pop an element
    Consumer pop an element
    Producer append an element
    Producer append an element
    Consumer pop an element
    Consumer pop an element
    Producer append an element
    Producer append an element

The above code implements the classic producer and consumer problems. It defines two process classes, one is the consumer and the other is the producer.

In addition, the Queue in multiprocessing is used to define a shared Queue, and then two semaphores are defined. One represents the number of empty buffers and the other represents the number of occupied buffers.

The Producer uses the acquire method to occupy a buffer location. The size of the buffer free area is reduced by 1. Next, lock the buffer, operate on the buffer, and then release the lock. Finally, increase the number of buffer locations occupied by the representative by 1, and the consumer is the opposite.

We find that the two processes are running alternately. The producer first puts in the items in the buffer, and then the consumer takes them out and keeps cycling. You can understand the usage of Semaphore semaphore through the above example. Through Semaphore, we can well control the number of concurrent accesses of processes to resources.

queue

In the above example, we use Queue as the shared Queue for process communication.

If we replace the Queue in the above program with an ordinary list, it will not work at all, because the resources between processes are not shared. Even if the list is changed in one process, the state of the list cannot be obtained in another process, so declaring global variables is useless for multiple processes.

How do processes share data? You can use Queue, that is, Queue. Of course, the Queue here refers to the Queue in multiprocessing.

 

Still using the above example, one process puts random data into the queue, and then another process takes out the data. / / in the windows environment, the expected results are not obtained after running.

from multiprocessing import Process, Semaphore, Lock, Queue
import time
from random import random

buffer = Queue(10)
empty = Semaphore(2)
full = Semaphore(0)
lock = Lock()


class Consumer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            full.acquire()
            lock.acquire()
            print(f'Consumer get {buffer.get()}')
            time.sleep(1)
            lock.release()
            empty.release()


class Producer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            empty.acquire()
            lock.acquire()
            num = random()
            print(f'Producer put {num}')
            buffer.put(num)
            time.sleep(1)
            lock.release()
            full.release()


if __name__ == '__main__':
    p = Producer()
    c = Consumer()
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

##Output result:
    Producer put  0.719213647437
    Producer put  0.44287326683
    Consumer get 0.719213647437
    Consumer get 0.44287326683
    Producer put  0.722859424381
    Producer put  0.525321338921
    Consumer get 0.722859424381
    Consumer get 0.525321338921

In the above example, we declare two processes, one is the Producer and the other is the Consumer. The Producer keeps adding random numbers to the Queue and the Consumer keeps taking random numbers from the Queue.

The producer calls the put method of the Queue when putting the data, and the consumer uses the get method when fetching the data. In this way, we can share the data between the two processes through the Queue.

 

The Conduit

Just now we used Queue to realize data sharing between processes. What is better for direct communication between processes, such as sending and receiving information? You can use Pipe, Pipe.

Pipeline, we can understand it as the communication channel between two processes. The pipeline can be one-way, that is, half duplex: one process is responsible for sending messages and the other process is responsible for receiving messages; It can also be a two-way duplex, that is, sending and receiving messages to each other.

By default, the Pipe object is declared as a two-way Pipe. If you want to create a one-way Pipe, you can pass in the deplex parameter as False during initialization.

from multiprocessing import Process, Pipe


class Consumer(Process):
    def __init__(self, pipe):
        Process.__init__(self)
        self.pipe = pipe

    def run(self):
        self.pipe.send('Consumer Words')
        print(f'Consumer Received: {self.pipe.recv()}')

class Producer(Process):
    def __init__(self, pipe):
        Process.__init__(self)
        self.pipe = pipe

    def run(self):
        print(f'Producer Received: {self.pipe.recv()}')
        self.pipe.send('Producer Words')

if __name__ == '__main__':
    pipe = Pipe()
    p = Producer(pipe[0])
    c = Consumer(pipe[1])
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

##Output result:
Producer Received: Consumer Words
Consumer Received: Producer Words
Main Process Ended

In this example, we declare a two-way pipeline by default, and then pass both ends of the pipeline to two processes respectively. The two processes send and receive each other.

Pipe is like a bridge built between processes. Using it, we can easily realize inter process communication.

 

Process pool

Earlier, we talked about how to use Process to create processes and how to use Semaphore to control the number of concurrent executions of processes.

If we encounter such a problem now, I have 10000 tasks. Each task needs to start a process to execute, and after a process runs, I need to start the next process immediately. At the same time, I also need to control the number of concurrent processes, which can not be too high, Otherwise, the CPU cannot handle it (if the processes running at the same time can be maintained at the highest constant value, of course, the utilization rate is the highest).

 

So how can we achieve this demand?

It can be implemented with Process and Semaphore, but it is cumbersome to implement. And this kind of demand is very common in peacetime. At this point, we can send the Process Pool, that is, the Pool in multiprocessing.

Pool can provide a specified number of processes for users to call. When a new request is submitted to the pool, if the pool is not full, a new process will be created to execute the request; However, if the number of processes in the pool has reached the specified maximum, the request will wait until a process in the pool ends, and a new process will be created to execute it.

 

Let's use an example to realize it. The code is as follows:

from multiprocessing import Pool
import time

def function(index):
    print(f'Start process: {index}')
    time.sleep(3)
    print(f'End process {index}', )

if __name__ == '__main__':
    pool = Pool(processes=3)
    for i in range(4):
        pool.apply_async(function, args=(i,))

    print('Main Process started')
    pool.close()
    pool.join()
    print('Main Process ended')

##Output result:
Main Process started
Start process: 0
Start process: 1
Start process: 2
End process 2End process 1End process 0


Start process: 3
End process 3
Main Process ended

In this example, we declare a process pool of size 3, which is specified by the processes parameter. If not specified, the number of processes will be automatically allocated according to the processor kernel. Then we use apply_ The async method adds the process, and args can be used to pass parameters.

 

The size of the process pool is 3, so you can initially see that three processes are executing at the same time, and the fourth process is waiting. After a process runs, the fourth process runs immediately, resulting in the above operation effect.

Finally, we need to remember to call the close method to close the process pool, so that it no longer accepts new tasks. Then we call the join method to let the main process wait for the exit of the child process. After the sub process is completed, the main process will then run and end.

 

However, the above writing method is somewhat cumbersome. Here we introduce a better map method of process pool, which can simplify the above writing method a lot.

How does the map method work? The first parameter is the execution method corresponding to the process to be started, and the second parameter is an iterative object, in which each element will be passed to the execution method.

For example: now we have a list containing many URLs. In addition, we also define a method to grab and parse the content of each URL. Then we can directly pass in the method name in the first parameter of the map and the URL array in the second parameter.

 

Let's use an example to experience:

from multiprocessing import Pool
import urllib.request
import urllib.error

def scrape(url):
    try:
        urllib.request.urlopen(url)
        print(f'URL {url} Scraped')
    except (urllib.error.HTTPError, urllib.error.URLError):
        print(f'URL {url} not Scraped')

if __name__ == '__main__':
    pool = Pool(processes=3)
    urls = [
        'https://www.baidu.com',
        'http://www.meituan.com/',
        'http://blog.csdn.net/',
        'http://xxxyxxx.net'
    ]
    pool.map(scrape, urls)
    pool.close()

##Output result:
URL https://www.baidu.com Scraped
URL http://xxxyxxx.net not Scraped
URL http://www.meituan.com/ Scraped
URL http://blog.csdn.net/ Scraped

In this example, we first define a scratch method, which receives a parameter url. Here, we request the link, and then output the information of successful crawling. If there is an error, it will output the information of failed crawling.

First, we need to initialize a Pool and specify the number of processes as 3. Then we declare a urls list, and then we call the map method. The first parameter is the execution method corresponding to the process, and the second parameter is the urls list. The map method will successively pass each element of urls as the parameter of the graph, start a new process and add it to the process Pool for execution.

In this way, we can realize the parallel operation of three processes. Different processes output the corresponding crawling results independently of each other. It can be seen that we can easily realize the execution of multiple processes by using the map method of Pool.

 

 

This blog is mainly derived from # hooking Education: Lecture 5 and 6 of 52 lectures on easy handling of web crawlers. It is very helpful for a preliminary understanding of multithreading and multiprocessing, especially from the perspective of python.

reference resources:

Lesson 05: multi-channel acceleration, understand the basic principle of multithreading

Lesson 06: multi-channel acceleration, understand the basic principle of multi process

Topics: Python