Should python use multithreading

Posted by Studio381 on Fri, 31 Dec 2021 09:27:23 +0100

In the summary, concurrent Before the futures library, let's figure out three questions: (1)python Is multithreading useful? (2) How does the python virtual machine mechanism control the execution of code? (3) What is the principle of multiprocessing in python?

1. Let's look at two examples first

(1) Example 1

The maximum common divisor is calculated by three methods: single thread, multi thread and multi process

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
import time 

def gcd(pair): a, b = pair low = min(a, b) for i in range(low, 0, -1): if a % i == 0 and b % i == 0: return i

 numbers = [ (1963309, 2265973), (1879675, 2493670), (2030677, 3814172), (1551645, 2229620), (1988912, 4736670), (2198964, 7876293)

if __name__ == '__main__':
    # Do not use multithreading and multiprocessing
    start = time.time()
    results = list(map(gcd,numbers))
    end = time.time()
    print('not used--timestamp:{:.3f} second'.format(end-start))

    #Using multithreading
    start = time.time()
    pool = ThreadPoolExecutor(max_workers=3)
    results = list(pool.map(gcd,numbers))
    end = time.time()
    print('Using multithreading--timestamp:{:.3f} second'.format(end-start))

    #Using multiple processes
    start = time.time()
    pool = ProcessPoolExecutor(max_workers=3)
    results = list(pool.map(gcd,numbers))
    end = time.time()
    print('Use multiprogramming--timestamp:{:.3f} second'.format(end-start))

Output:

Previously, the number of threads and processes was said to be 3, but now it is changed to 4 and tested again

To better illustrate the problem, continue to increase the number of threads and processes to 5

As for the difference, we feel that the test conditions (calculation is too simple) and the test environment will affect the test results

(2) Example 2

Similarly, use single thread, multi thread and multi process to crawl web pages, but simply return status_code

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
import time 
import requests

def download(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
                'Connection':'keep-alive',
                'Host':'example.webscraping.com'}
    response = requests.get(url, headers=headers)
    return(response.status_code)
    
if __name__ == '__main__':
    urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
               'http://example.webscraping.com/places/default/view/Aland-Islands-2',
               'http://example.webscraping.com/places/default/view/Albania-3',
               'http://example.webscraping.com/places/default/view/Algeria-4',
               'http://example.webscraping.com/places/default/view/American-Samoa-5']
               
    start = time.time()           
    result = list(map(download, urllist))
    end = time.time()
    print('status_code:',result)
    print('not used--timestamp:{:.3f}'.format(end-start))
    
    pool = ThreadPoolExecutor(max_workers = 3)
    start = time.time()           
    result = list(pool.map(download, urllist))
    end = time.time()
    print('status_code:',result)
    print('Using multithreading--timestamp:{:.3f}'.format(end-start))
    
    pool = ProcessPoolExecutor(max_workers = 3)
    start = time.time()           
    result = list(pool.map(download, urllist))
    end = time.time()
    print('status_code:',result)
    print('Use multiprogramming--timestamp:{:.3f}'.format(end-start))

Output:

You can see the difference at a glance

2. How does the python virtual machine mechanism control code execution?

For Python, as an interpretive language, the Python interpreter must be safe and efficient. We all know the problems encountered in multithreaded programming. The interpreter should pay attention to avoid operating internally shared data in different threads. At the same time, it should ensure that there are always maximized computing resources when managing user threads. Python protects data security by using global interpreter locks.

The execution of Python code is controlled by the python virtual machine, That is, python first compiles the code (. py file) into bytecode (bytecode in Python virtual machine program corresponds to PyCodeObject object, and. pyc file is the representation of bytecode on disk). Give it to bytecode virtual machine, and then the virtual machine executes bytecode instructions one by one, so as to complete the execution of the program. Python can only be executed by one thread in the virtual machine at the time of design. Similarly, although Python interpreter Multiple threads can run in the interpreter, but only one thread runs in the interpreter at any time. The access to Python virtual machine is controlled by the global interpreter lock, which can ensure that only one thread is running at the same time.

In a multithreaded environment, the python virtual machine executes as follows:

(1) set GIL(global interpreter lock)

(2) switch to a thread for execution

(3) run: the specified number of bytecode instructions and threads actively give up control (you can call time.sleep(0))

(4) set the thread to sleep

(5) unlock GIL

(6) repeat the above steps again.

The characteristics of GIL lead to python can not make full use of multi-core cpu. For I/O-oriented programs that call the built-in operating system C code, the GIL will be released before the I/O call to allow other threads to run while the thread is waiting for I/O. if the thread does not use many I/O operations, it will occupy the processor and GIL in its own time slice.

3. Is Python multithreading useful?

Through the previous examples and the understanding of python virtual mechanism, it should be clear that I/O-Intensive python programs can make full use of the benefits of multithreading than computing intensive programs. In a word, if you do not use python multithreading in computing intensive programs and use python multiprocessing for concurrent programming, there will be no GIL problem, and you can make full use of multi-core cpu.

(1) GIL is not a bug, and Guido is not the only thing left behind because of its limited level. Uncle GUI once said that trying to do thread safety in other ways instead of GIL has doubled the overall efficiency of python language. Weighing the pros and cons, GIL is the best choice - not to go, but to keep it on purpose

(2) if you want to make Python faster and don't want to write C, use python. This is the real killer

(3) coprocessing and gevent can be used to improve cpu utilization

4. python multi process execution principle

The ProcessPoolExecutor class uses the underlying mechanism provided by the multiprocessing module. Take example 2 as an example to describe the following multiprocessing execution process:

(1) transfer each input data in the urllist to the map

(2) serialize the data with pickle module and turn it into binary form

(3) send the serialized data from the process where the interpreter is located to the process where the sub interpreter is located through the local socket

(4) in the subprocess, use pickle to deserialize the binary data and restore it to python object

(5) introduce the python module containing the download function

(6) each sub process calculates its own input data in parallel

(7) serialize the running results and convert them into bytes

(8) copy these bytes to the main process through socket

(9) the main process deserializes these bytes and restores them to python objects

(10) finally, the calculation results obtained by each sub process are merged into a list and returned to the caller.

multiprocessing costs a lot, because it is necessary to serialize and deserialize the communication between the main process and the child process

Publisher: full stack programmer, stack length, please indicate the source for Reprint: https://javaforall.cn/120046.html Original link: https://javaforall.cn