python multi-process learning

Posted by dominicd on Sat, 08 Jun 2019 23:20:03 +0200

When encountering CPU-intensive scenarios, we can consider a multi-process approach to solve the problem.
For example, I wrote a txt text file myself. It stores 1-999999 numbers sequentially and writes them in three cycles. So I look for the number of times 999999 appears. This scenario is computationally intensive, that is, it belongs to the cpu intensive scenario. We can try multi-process.

#!/usr/bin/python env
# -*- coding:utf-8 -*-
import time
import multiprocessing
import os

def get_count_number(file_path):

    count_num = 0
    print 'the process pid is %s and the parent pid is %s : ' %(os.getpid(), os.getppid())
    with open(file_path) as f:
        str = f.readlines()
        for one_line in str:
            if '999999' in one_line:
                # print one_line
                count_num += 1
    print count_num
start_time = time.time()
txt_list = ['00001.txt','00002.txt', '00003.txt']

for file_path in txt_list:
    get_count_number(file_path)
# print get_count_number('00001.txt')

end_time = time.time()
print '##########'
print 'the total time to run is: ', end_time - start_time

multi_start_time = time.time()
process_list = []
for each_file in txt_list:
    each_process = multiprocessing.Process(target=get_count_number, args=(each_file,))
    process_list.append(each_process)
    #
for each_p in process_list:
    each_p.start()

for each_p in process_list:
    each_p.join()

multi_end_time = time.time()
print '##########'
print 'multi processing time is: ',multi_end_time - multi_start_time

We have initiated many processes and the results are as follows:

the process pid is 79873 and the parent pid is 78757 : 
3
the process pid is 79873 and the parent pid is 78757 : 
3
the process pid is 79873 and the parent pid is 78757 : 
3
##########
the total time to run is:  1.18405485153
the process pid is 79874 and the parent pid is 79873 : 
the process pid is 79875 and the parent pid is 79873 : 
the process pid is 79876 and the parent pid is 79873 : 
3
3
3
##########
multi processing time is:  0.489647865295

We can see that it takes 1.184 seconds to read three files in sequence. In multi-process mode, it took 0.49 seconds.
When there are more files, the process is not the more concurrent execution, the better. The number of concurrent processes has an optimal configuration, which is related to the machine configuration of the executing program. So we introduced the concept of thread pool, that is, at the same time, several processes are executed concurrently.
In the example, I read six files and set the process pool to three.

#!/usr/bin/python env
# -*- coding:utf-8 -*-

#!/usr/bin/python env
# -*- coding:utf-8 -*-
import time
import multiprocessing
import os

def get_count_number(file_path):

    count_num = 0
    print 'the process pid is %s and the parent pid is %s : ' %(os.getpid(), os.getppid())
    with open(file_path) as f:
        str = f.readlines()
        for one_line in str:
            if '999999' in one_line:
                # print one_line
                count_num += 1
    print count_num
start_time = time.time()
txt_list = ['00001.txt','00002.txt', '00003.txt','00004.txt','00005.txt', '00006.txt']

for file_path in txt_list:
    get_count_number(file_path)
# print get_count_number('00001.txt')

end_time = time.time()
print '##########'
print 'the total time to run is: ', end_time - start_time


multi_start_time = time.time()
pool = multiprocessing.Pool(processes=3)
for each_file in txt_list:
    each_process = pool.apply_async(func=get_count_number, args=(each_file,))
    #
pool.close()
pool.join()
multi_end_time = time.time()
print '##########'
print 'multi processing time is: ',multi_end_time - multi_start_time

The results are as follows:

the process pid is 79882 and the parent pid is 78757 : 
3
the process pid is 79882 and the parent pid is 78757 : 
3
the process pid is 79882 and the parent pid is 78757 : 
3
the process pid is 79882 and the parent pid is 78757 : 
3
the process pid is 79882 and the parent pid is 78757 : 
3
the process pid is 79882 and the parent pid is 78757 : 
3
##########
the total time to run is:  2.45317697525
the process pid is 79883 and the parent pid is 79882 : 
the process pid is 79884 and the parent pid is 79882 : 
the process pid is 79885 and the parent pid is 79882 : 
3
3
3
the process pid is 79885 and the parent pid is 79882 : 
the process pid is 79884 and the parent pid is 79882 : 
the process pid is 79883 and the parent pid is 79882 : 
3
3
3
##########
multi processing time is:  1.25623202324

As you can see, it takes 2.45 seconds to read six files sequentially, and 1.26 seconds to execute them concurrently in a process pool.

When the number of processes is set to 2, pool = multiprocessing.Pool(processes=2)
The results are as follows:

the process pid is 79893 and the parent pid is 78757 : 
3
the process pid is 79893 and the parent pid is 78757 : 
3
the process pid is 79893 and the parent pid is 78757 : 
3
the process pid is 79893 and the parent pid is 78757 : 
3
the process pid is 79893 and the parent pid is 78757 : 
3
the process pid is 79893 and the parent pid is 78757 : 
3
##########
the total time to run is:  2.41508388519
the process pid is 79894 and the parent pid is 79893 : 
the process pid is 79895 and the parent pid is 79893 : 
3
3
the process pid is 79894 and the parent pid is 79893 : 
the process pid is 79895 and the parent pid is 79893 : 
3
3
the process pid is 79895 and the parent pid is 79893 : 
the process pid is 79894 and the parent pid is 79893 : 
3
3
##########
multi processing time is:  1.55285310745

As you can see, it takes 1.55 seconds for multi-process to read 6 files, which is longer than the process pool setting to 3. This is understandable. After all, three people work at the same time faster than two people work at the same time. However, it is not that the larger the number of concurrent processes is, the better. For example, reading 500 files, setting the number of processes to 100, it is not necessarily better than setting the number of concurrent processes at the same time. Setting it to 50 reads faster.
Summary: We can conclude that when encountering CPU-intensive (computational-intensive) scenarios, multi-process execution can be considered.

Topics: Python