Multithreading and multiprocessing
1, What are processes and threads?
Process: a running program Every time we execute a program, our operating system automatically prepares some necessary resources for the program (for example, allocate memory and create an executable thread.)
Thread: the execution process within a program that can be directly scheduled by the CPU It is the smallest unit that the operating system can schedule operations It is included in the process and is the actual operation unit in the process
Relationship between process and thread:
A process is a resource unit Thread is the unit of execution It's like a company The resources of a company are tables, chairs, benches and computer water dispensers. However, if we say that a company is running, running There must be someone who can work for this company The same is true in the program. The process is all kinds of resources needed for the program to run However, if the program wants to run, it must be scheduled and executed by the CPU by the thread
Every program we run has a thread by default Even if it is only a helloworld level program Want to execute There will also be a thread generated
2, Multithreading
As the name suggests, multithreading is to let the program produce multiple threads to execute together Take the company as an example If there is only one employee in a company, the work efficiency will not be much higher How to improve efficiency? Just recruit more people
How to realize multithreading? In python, there are two schemes to realize multithreading
1. Create a Thread directly with Thread
Let's first look at the effect of single thread
def func():
for i in range(1000):
print("func", i)
if __name__ == '__main__':
func()
for i in range(1000):
print("main", i)
Look at multithreading
from threading import Thread
def func():
for i in range(1000):
print("func", i)
if __name__ == '__main__':
t = Thread(target=func)
t.start()
for i in range(1000):
print("main", i)
2. Inherit Thread class
from threading import Thread
class MyThread(Thread):
def run(self):
for i in range(1000):
print("func", i)
if __name__ == '__main__':
t = MyThread()
t.start()
for i in range(1000):
print("main", i)
The above two are the most basic schemes for creating multithreading in python Python also provides thread pools
3. Thread pool
python also provides thread pool function You can create multiple threads at one time, and you don't need our programmers to maintain them manually Everything is left to the thread pool to manage automatically
#Thread pool
def fn(name):
for i in range(1000):
print(name, i)
if __name__ == '__main__':
with ThreadPoolExecutor(10) as t:
for i in range(100):
t.submit(fn, name=f "thread {i}")
What if the task has a return value?
def func(name):
time.sleep(2)
return name
def do_callback(res):
print(res.result())
if __name__ == '__main__':
with ThreadPoolExecutor(10) as t:
names = ["thread 1", "thread 2", "thread 3"]
for name in names:
# scheme 1: add callback
t.submit(func, name).add_done_callback(do_callback)
if __name__ == '__main__':
start = time.time()
with ThreadPoolExecutor(10) as t:
names = [5, 2, 3]
In # scheme 2, map is directly used for task distribution Finally, the results are returned uniformly
results = t.map(func, names,) # results are executed in the order you pass them. The price is if the first one doesn't end There was no result later
for r in results:
print("result", r)
print(time.time() - start)
4. Application of multithreading in crawler
http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml
The case of Xinfadi is still used
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
def get_page_source(url):
resp = requests.get(url)
return resp.text
def get_totle_count():
url = "http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
source = get_page_source(url)
tree = etree.HTML(source)
last_href = tree.xpath("//div[@class='manu']/a[last()]/@href")[0]
totle = last_href.split("/")[-1].split(".")[0]
return int(totle)
def download_content(url):
source = get_page_source(url)
tree = etree.HTML(source)
trs = tree.xpath("//table[@class='hq_table']/tr[position() > 1]")
result = []
for tr in trs:
tds = tr.xpath("./td/text()")
result.append((tds[0], tds[1], tds[2], tds[3], tds[4], tds[5], tds[6]))
return result
def main():
f = open("data.csv", mode="w")
totle = get_totle_count()
url_tpl = "http://www.xinfadi.com.cn/marketanalysis/0/list/{}.shtml"
with ThreadPoolExecutor(50) as t:
data = t.map(download_content, (url_tpl.format(i) for i in range(1, totle+1)))
# get the return of all tasks
for item in data:
# each task's data loops out a row
for detial in item:
# write file
content = ",".join(detial) + "\n"
print(content)
f.write(content)
if __name__ == '__main__':
main()
3, Multi process
After all, the value a company can create is limited What should I do? Open a branch office This is called multiprocessing The scheme of implementing multi process in python is almost the same as that of multithreading Very simple
###1. Create Process directly with Process
def func():
for i in range(1000):
print("func", i)
if __name__ == '__main__':
p = Process(target=func)
p.start()
for i in range(1000):
print("main", i)
2. Inherit Process class
class MyProcess(Process):
def run(self):
for i in range(1000):
print("MyProcess", i)
if __name__ == '__main__':
t = MyProcess()
t.start()
for i in range(1000):
print("main", i)
###3. Application of multi process in crawler
We seldom use multiprocessing directly The most suitable situation for using multiple processes is that multiple tasks need to be executed together And the data may intersect with each other, but the functions are relatively independent For example, if we build a proxy IP pool ourselves, we need to grab it from the network. The captured IP can only be used after verification At this time, the capture task and verification task are equivalent to two completely independent functions At this point, you can start multiple processes to achieve For another example, if we encounter image capture, we know that the image is generally in the img tag of the web page, and the src attribute stores the download address of the image At this point, we can adopt a multi - process solution, a download address responsible for crazy scanning pictures Another process is only responsible for downloading pictures
To sum up, multiple tasks need to be executed in parallel, but the tasks are relatively independent (not necessarily completely independent) Consider using multiple processes
#Process 1 Extract the download path of the picture from the picture website
def get_pic_src(q):
print("start main page spider")
url = "http://www.591mm.com/mntt/"
resp = requests.get(url)
tree = etree.HTML(resp.text)
child_hrefs = tree.xpath("//div[@class='MeinvTuPianBox']/ul/li/a/@href")
print("get hrefs from main page", child_hrefs)
for href in child_hrefs:
href = parse.urljoin(url, href)
print("handle href", href)
resp_child = requests.get(href)
tree = etree.HTML(resp_child.text)
pic_src = tree.xpath("//div[@id='picBody']//img/@src")[0]
print(f"put {pic_src} to the queue")
q.put(pic_src)
# operation, paging image capture
# print("ready to another!")
# others = tree.xpath('//ul[@class="articleV2Page"]')
# if others:
#Process 2 Extract the download path of the picture from the picture website
def download(url):
print("start download", url)
name = url.split("/")[-1]
resp = requests.get(url)
with open(name, mode="wb") as f:
f.write(resp.content)
resp.close()
print("downloaded", url)
def start_download(q):
with ThreadPoolExecutor(20) as t:
while True:
t.submit(download, q.get()) # start
def main():
q = Queue()
p1 = Process(target=start_download, args=(q,))
p2 = Process(target=get_pic_src, args=(q,))
p1.start()
p2.start()
if __name__ == '__main__':
main()
####################################### ####################################### thread ####################################### from threading import Thread # thread # # 1. Well defined What tasks do threads do # def func(): # for i in range(1000): # print("Child thread", i) # # # # 2. Write main to create a sub thread # if __name__ == '__main__': # To write this # # t = Thread(target=func) # Create a child thread that has not been executed # # # Start a thread # # t.start() # # # The main thread continues to execute # # for i in range(1000): # # print("main thread", i) # t1 = Thread(target=func) # t2 = Thread(target=func) # t1.start() # t2.start() # def func(url): # # Write the work of crawler # print("I want to write the work of the reptile", url) # # if __name__ == '__main__': # urls = ["first", "the second", "Third"] # for u in urls: # # Note that the more threads you create, the better Number of CPU cores * 4 # t = Thread(target=func, args=(u, )) # args can pass parameters to threads But it must be a tuple # t.start() # # class MyThread(Thread): # Define a class yourself Inherit Thread # # def __init__(self, name): # super(MyThread, self).__init__() # self.name = name # # def run(self): # changeless. # You must write the run method # for i in range(1000): # print(self.name, i) # # # if __name__ == '__main__': # t1 = MyThread("Thread 1") # t2 = MyThread("Thread 2") # # t1.start() # t2.start() ####################################### ####################################### Thread pool ####################################### from concurrent.futures import ThreadPoolExecutor import time import random # def func(name): # for i in range(100): # print(name, i) # # # if __name__ == '__main__': # with ThreadPoolExecutor(5) as t: # t.submit(func, "Thread 1") # Submit submit # t.submit(func, "Thread 2") # Submit submit # t.submit(func, "Thread 3") # Submit submit # t.submit(func, "Thread 4") # Submit submit # t.submit(func, "Thread 5") # Submit submit # t.submit(func, "Thread 6") # Submit submit # t.submit(func, "Thread 7") # Submit submit # t.submit(func, "Thread 8") # Submit submit # t.submit(func, "Thread 9") # Submit submit def func(name): # for i in range(100): # print(name, i) time.sleep(random.randint(1,3)) return name def fn(res): print(res.result()) # The results of this scheme are not in the normal order if __name__ == '__main__': task_list = ["Thread 2", "Thread 3", "Thread 4", "Thread 5", "Thread 6"] with ThreadPoolExecutor(3) as t: # for task in task_list: # t.submit(func, task).add_done_callback(fn) # Submit task directly result = t.map(func, task_list) # Submit a bunch of tasks directly for r in result: print(r) ####################################### ####################################### Thread pool application ####################################### import requests from lxml import etree from concurrent.futures import ThreadPoolExecutor import time # csv: Comma separated files # Zhou Runfa,Li Jiacheng,Li Jiaqi, f = open("data.csv", mode="w", encoding='utf-8') def download_xinfadi(url): resp = requests.get(url) content = resp.text tree = etree.HTML(content) # tr_list = tree.xpath("//table[@class='hq_table']/tr")[1:] tr_list = tree.xpath("//table[@class='hq_table']/tr[position()>1]") for tr in tr_list: # Each line tds = tr.xpath("./td/text()") f.write(",".join(tds)) f.write("\n") if __name__ == '__main__': start = time.time() with ThreadPoolExecutor(30) as t: for i in range(1, 16): url = f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml" # download_xinfadi(url) t.submit(download_xinfadi, url) print("Multithreading is used", time.time() - start) start = time.time() for i in range(1, 16): url = f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml" download_xinfadi(url) print("Single threaded", time.time() - start) f.close() ####################################### ####################################### process ####################################### from multiprocessing import Process from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor def func(name): for i in range(1000): print(name, i) if __name__ == '__main__': p1 = Process(target=func, args=("Process 1",)) p2 = Process(target=func, args=("Process 2",)) p1.start() p2.start() # Multiple tasks are extremely similar. Using multithreading # Multiple processes are used when multiple tasks are almost unrelated # franco IP Agent pool. # 1. Go to major free agents ip Website to grab IP # 2. Verify each IP Available ####################################### ####################################### Process application ####################################### """ The following remarks are limited to today's case: Process 1. Access the main page, Get the details page in the main page url. Go to the details page. Extract the download address of the picture in the details page Process 2. Batch download pictures Communication between processes queue """ import requests from urllib import parse # conversion from lxml import etree from multiprocessing import Process, Queue from concurrent.futures import ThreadPoolExecutor def get_img_src(q): url = "http://www.591mm.com/mntt/6.html" resp = requests.get(url) resp.encoding = 'utf-8' # print(resp.text) tree = etree.HTML(resp.text) href_list = tree.xpath("//div[@class='MeinvTuPianBox']/ul/li/a[1]/@href") for href in href_list: # http://www.591mm.com/mntt/6.html # /mntt/hgmn/307626.html # Splicing url address child_url = parse.urljoin(url, href) # print(child_url) resp_child = requests.get(child_url) resp_child.encoding = "utf-8" child_tree = etree.HTML(resp_child.text) src = child_tree.xpath("//img[@id='mouse_src']/@src")[0] q.put(src) # Go inside q.put("OK Yes") def download(url): file_name = url.split("/")[-1] with open(file_name, mode="wb") as f: resp = requests.get(url) f.write(resp.content) # Download complete def download_all(q): # Create thread pool in process with ThreadPoolExecutor(10) as t: while 1: src = q.get() # Take it out if src == "OK Yes": break print(src) t.submit(download, src) if __name__ == '__main__': q = Queue() p1 = Process(target=get_img_src, args=(q,)) p2 = Process(target=download_all, args=(q,)) p1.start() p2.start()