Multithreading and multiprocessing

Posted by jedney on Fri, 28 Jan 2022 17:09:53 +0100

Multithreading and multiprocessing

 

1, What are processes and threads?

Process: a running program Every time we execute a program, our operating system automatically prepares some necessary resources for the program (for example, allocate memory and create an executable thread.)

Thread: the execution process within a program that can be directly scheduled by the CPU It is the smallest unit that the operating system can schedule operations It is included in the process and is the actual operation unit in the process

 

Relationship between process and thread:

A process is a resource unit Thread is the unit of execution It's like a company The resources of a company are tables, chairs, benches and computer water dispensers. However, if we say that a company is running, running There must be someone who can work for this company The same is true in the program. The process is all kinds of resources needed for the program to run However, if the program wants to run, it must be scheduled and executed by the CPU by the thread

Every program we run has a thread by default Even if it is only a helloworld level program Want to execute There will also be a thread generated

 

2, Multithreading

As the name suggests, multithreading is to let the program produce multiple threads to execute together Take the company as an example If there is only one employee in a company, the work efficiency will not be much higher How to improve efficiency? Just recruit more people

How to realize multithreading? In python, there are two schemes to realize multithreading

 

1. Create a Thread directly with Thread

Let's first look at the effect of single thread

def func():
   for i in range(1000):
       print("func", i)


if __name__ == '__main__':
   func()
   for i in range(1000):
       print("main", i)

Look at multithreading

from threading import Thread


def func():
   for i in range(1000):
       print("func", i)


if __name__ == '__main__':
   t = Thread(target=func)
   t.start()
   for i in range(1000):
       print("main", i)

 

2. Inherit Thread class

from threading import Thread


class MyThread(Thread):
   def run(self):
       for i in range(1000):
           print("func", i)


if __name__ == '__main__':
   t = MyThread()
   t.start()
   for i in range(1000):
       print("main", i)

The above two are the most basic schemes for creating multithreading in python Python also provides thread pools

3. Thread pool

python also provides thread pool function You can create multiple threads at one time, and you don't need our programmers to maintain them manually Everything is left to the thread pool to manage automatically

#Thread pool
def fn(name):
   for i in range(1000):
       print(name, i)


if __name__ == '__main__':
   with ThreadPoolExecutor(10) as t:
       for i in range(100):
t.submit(fn, name=f "thread {i}")

What if the task has a return value?


def func(name):
   time.sleep(2)
   return name


def do_callback(res):
   print(res.result())


if __name__ == '__main__':
   with ThreadPoolExecutor(10) as t:
names = ["thread 1", "thread 2", "thread 3"]
       for name in names:
# scheme 1: add callback
           t.submit(func, name).add_done_callback(do_callback)

           
if __name__ == '__main__':
   start = time.time()
   with ThreadPoolExecutor(10) as t:
       names = [5, 2, 3]
In # scheme 2, map is directly used for task distribution Finally, the results are returned uniformly
results = t.map(func, names,) # results are executed in the order you pass them. The price is if the first one doesn't end There was no result later
       for r in results:
           print("result", r)
   print(time.time() - start)

 

4. Application of multithreading in crawler

http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml

The case of Xinfadi is still used

import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor


def get_page_source(url):
   resp = requests.get(url)
   return resp.text


def get_totle_count():
   url = "http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
   source = get_page_source(url)
   tree = etree.HTML(source)
   last_href = tree.xpath("//div[@class='manu']/a[last()]/@href")[0]
   totle = last_href.split("/")[-1].split(".")[0]
   return int(totle)


def download_content(url):
   source = get_page_source(url)
   tree = etree.HTML(source)
   trs = tree.xpath("//table[@class='hq_table']/tr[position() > 1]")
   result = []
   for tr in trs:
       tds = tr.xpath("./td/text()")
       result.append((tds[0], tds[1], tds[2], tds[3], tds[4], tds[5], tds[6]))
   return result


def main():
   f = open("data.csv", mode="w")
   totle = get_totle_count()
   url_tpl = "http://www.xinfadi.com.cn/marketanalysis/0/list/{}.shtml"

   with ThreadPoolExecutor(50) as t:
       data = t.map(download_content, (url_tpl.format(i) for i in range(1, totle+1)))
# get the return of all tasks
       for item in data:
# each task's data loops out a row
           for detial in item:
# write file
               content = ",".join(detial) + "\n"
               print(content)
               f.write(content)


if __name__ == '__main__':
   main()

 

3, Multi process

After all, the value a company can create is limited What should I do? Open a branch office This is called multiprocessing The scheme of implementing multi process in python is almost the same as that of multithreading Very simple

###1. Create Process directly with Process

def func():
   for i in range(1000):
       print("func", i)


if __name__ == '__main__':
   p = Process(target=func)
   p.start()

   for i in range(1000):
       print("main", i)

2. Inherit Process class

class MyProcess(Process):
   def run(self):
       for i in range(1000):
           print("MyProcess", i)


if __name__ == '__main__':
   t = MyProcess()
   t.start()
   for i in range(1000):
       print("main", i)

 

###3. Application of multi process in crawler

We seldom use multiprocessing directly The most suitable situation for using multiple processes is that multiple tasks need to be executed together And the data may intersect with each other, but the functions are relatively independent For example, if we build a proxy IP pool ourselves, we need to grab it from the network. The captured IP can only be used after verification At this time, the capture task and verification task are equivalent to two completely independent functions At this point, you can start multiple processes to achieve For another example, if we encounter image capture, we know that the image is generally in the img tag of the web page, and the src attribute stores the download address of the image At this point, we can adopt a multi - process solution, a download address responsible for crazy scanning pictures Another process is only responsible for downloading pictures

To sum up, multiple tasks need to be executed in parallel, but the tasks are relatively independent (not necessarily completely independent) Consider using multiple processes

#Process 1 Extract the download path of the picture from the picture website
def get_pic_src(q):
   print("start main page spider")
   url = "http://www.591mm.com/mntt/"
   resp = requests.get(url)
   tree = etree.HTML(resp.text)
   child_hrefs = tree.xpath("//div[@class='MeinvTuPianBox']/ul/li/a/@href")
   print("get hrefs from main page", child_hrefs)
   for href in child_hrefs:
       href = parse.urljoin(url, href)
       print("handle href", href)
       resp_child = requests.get(href)
       tree = etree.HTML(resp_child.text)
       pic_src = tree.xpath("//div[@id='picBody']//img/@src")[0]
       print(f"put {pic_src} to the queue")
       q.put(pic_src)
# operation, paging image capture
       # print("ready to another!")
       # others = tree.xpath('//ul[@class="articleV2Page"]')
       # if others:


#Process 2 Extract the download path of the picture from the picture website
def download(url):
   print("start download", url)
   name = url.split("/")[-1]
   resp = requests.get(url)
   with open(name, mode="wb") as f:
       f.write(resp.content)
   resp.close()
   print("downloaded", url)


def start_download(q):
   with ThreadPoolExecutor(20) as t:
       while True:
t.submit(download, q.get()) # start

           
def main():
   q = Queue()
   p1 = Process(target=start_download, args=(q,))
   p2 = Process(target=get_pic_src, args=(q,))
   p1.start()
   p2.start()


if __name__ == '__main__':
   main()
#######################################
#######################################  thread 
#######################################
from threading import Thread  # thread 

# # 1. Well defined What tasks do threads do
# def func():
#     for i in range(1000):
#         print("Child thread", i)
#
#
# # 2. Write main to create a sub thread
# if __name__ == '__main__':  # To write this
#     # t = Thread(target=func)  # Create a child thread that has not been executed
#     # # Start a thread
#     # t.start()
#     # # The main thread continues to execute
#     # for i in range(1000):
#     #     print("main thread", i)
#     t1 = Thread(target=func)
#     t2 = Thread(target=func)
#     t1.start()
#     t2.start()


# def func(url):
#     # Write the work of crawler
#     print("I want to write the work of the reptile", url)
#
# if __name__ == '__main__':
#     urls = ["first", "the second", "Third"]
#     for u in urls:
#         # Note that the more threads you create, the better Number of CPU cores * 4
#         t = Thread(target=func, args=(u, ))  # args can pass parameters to threads But it must be a tuple
#         t.start()


#
# class MyThread(Thread):  # Define a class yourself Inherit Thread
#
#     def __init__(self, name):
#         super(MyThread, self).__init__()
#         self.name = name
#
#     def run(self):  # changeless.  # You must write the run method
#         for i in range(1000):
#             print(self.name, i)
#
#
# if __name__ == '__main__':
#     t1 = MyThread("Thread 1")
#     t2 = MyThread("Thread 2")
#
#     t1.start()
#     t2.start()

#######################################
#######################################   Thread pool
#######################################
from concurrent.futures import ThreadPoolExecutor
import time
import random
# def func(name):
#     for i in range(100):
#         print(name, i)
#
#
# if __name__ == '__main__':
#     with ThreadPoolExecutor(5) as t:
#         t.submit(func, "Thread 1")  # Submit submit
#         t.submit(func, "Thread 2")  # Submit submit
#         t.submit(func, "Thread 3")  # Submit submit
#         t.submit(func, "Thread 4")  # Submit submit
#         t.submit(func, "Thread 5")  # Submit submit
#         t.submit(func, "Thread 6")  # Submit submit
#         t.submit(func, "Thread 7")  # Submit submit
#         t.submit(func, "Thread 8")  # Submit submit
#         t.submit(func, "Thread 9")  # Submit submit


def func(name):
    # for i in range(100):
    #     print(name, i)
    time.sleep(random.randint(1,3))
    return name

def fn(res):
    print(res.result())  # The results of this scheme are not in the normal order


if __name__ == '__main__':
    task_list = ["Thread 2", "Thread 3", "Thread 4", "Thread 5", "Thread 6"]
    with ThreadPoolExecutor(3) as t:
        # for task in task_list:
        #     t.submit(func, task).add_done_callback(fn)  # Submit task directly
        result = t.map(func, task_list)  # Submit a bunch of tasks directly
        for r in result:
            print(r)


#######################################
#######################################  Thread pool application
#######################################
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
import time

# csv: Comma separated files
# Zhou Runfa,Li Jiacheng,Li Jiaqi,
f = open("data.csv", mode="w", encoding='utf-8')


def download_xinfadi(url):
    resp = requests.get(url)
    content = resp.text
    tree = etree.HTML(content)
    # tr_list = tree.xpath("//table[@class='hq_table']/tr")[1:]
    tr_list = tree.xpath("//table[@class='hq_table']/tr[position()>1]")
    for tr in tr_list:   # Each line
        tds = tr.xpath("./td/text()")
        f.write(",".join(tds))
        f.write("\n")


if __name__ == '__main__':
    start = time.time()
    with ThreadPoolExecutor(30) as t:
        for i in range(1, 16):
            url = f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml"
            # download_xinfadi(url)
            t.submit(download_xinfadi, url)
    print("Multithreading is used", time.time() - start)
    start = time.time()
    for i in range(1, 16):
        url = f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml"
        download_xinfadi(url)
    print("Single threaded", time.time() - start)

    f.close()



#######################################
#######################################   process
#######################################

from multiprocessing import Process
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor


def func(name):
    for i in range(1000):
        print(name, i)


if __name__ == '__main__':
    p1 = Process(target=func, args=("Process 1",))
    p2 = Process(target=func, args=("Process 2",))
    p1.start()
    p2.start()


# Multiple tasks are extremely similar. Using multithreading
# Multiple processes are used when multiple tasks are almost unrelated
# franco IP Agent pool.
# 1. Go to major free agents ip Website to grab IP
# 2. Verify each IP Available



#######################################
#######################################   Process application
#######################################
"""

The following remarks are limited to today's case:

Process 1. Access the main page, Get the details page in the main page url.
    Go to the details page. Extract the download address of the picture in the details page

Process 2. Batch download pictures



Communication between processes
 queue

"""
import requests
from urllib import parse  # conversion
from lxml import etree
from multiprocessing import Process, Queue
from concurrent.futures import ThreadPoolExecutor


def get_img_src(q):
    url = "http://www.591mm.com/mntt/6.html"
    resp = requests.get(url)
    resp.encoding = 'utf-8'
    # print(resp.text)
    tree = etree.HTML(resp.text)
    href_list = tree.xpath("//div[@class='MeinvTuPianBox']/ul/li/a[1]/@href")
    for href in href_list:
        # http://www.591mm.com/mntt/6.html
        # /mntt/hgmn/307626.html
        # Splicing url address
        child_url = parse.urljoin(url, href)
        # print(child_url)
        resp_child = requests.get(child_url)
        resp_child.encoding = "utf-8"
        child_tree = etree.HTML(resp_child.text)
        src = child_tree.xpath("//img[@id='mouse_src']/@src")[0]
        q.put(src)  # Go inside
    q.put("OK Yes")


def download(url):
    file_name = url.split("/")[-1]
    with open(file_name, mode="wb") as f:
        resp = requests.get(url)
        f.write(resp.content)  # Download complete


def download_all(q):
    # Create thread pool in process
    with ThreadPoolExecutor(10) as t:
        while 1:
            src = q.get()  # Take it out
            if src == "OK Yes":
                break
            print(src)
            t.submit(download, src)


if __name__ == '__main__':
    q = Queue()
    p1 = Process(target=get_img_src, args=(q,))
    p2 = Process(target=download_all, args=(q,))

    p1.start()
    p2.start()