Python crawler tutorial 13-100 doodle map emoticon package multi thread crawling

Posted by jackie111 on Mon, 02 Dec 2019 17:25:43 +0100

Written in front

Today, in CSDN blog, I found that many people are crawling through a website called doutula. There are many expression packs in it, and then I look at them. There are various implementation methods. Today, I will implement a multi-threaded version for you. Key technology point aiohttp, you can take a look at my previous article, and then learn it.

The website will not be analyzed, but only to find the rule, splicing URL, matching key points, and then crawling.

Bar code

First, quickly import the modules we need. Unlike other articles, I put the same expression under the same folder, so I need to import the os module

import asyncio
import aiohttp
from lxml import etree
import os

Write the main entry method

if __name__ == '__main__':
    url_format = "http://www.doutula.com/article/list/?page={}"
    urls = [url_format.format(index) for index in range(1,586)]
    loop = asyncio.get_event_loop()
    tasks = [x_get_face(url) for url in urls]
    results = loop.run_until_complete(asyncio.wait(tasks))

We want to learn, not attack other servers, so limit the number of concurrent servers

sema = asyncio.Semaphore(3)

async def x_get_face(url):
    with(await sema):
        await get_face(url)

Finally, the operation is as fierce as a tiger. You can complete all the codes and get it done. There is nothing new in this part. Look for the picture link and download it.

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
async def get_face(url):
    print("In operation{}".format(url))
    async with aiohttp.ClientSession() as s:
        async with s.get(url,headers=headers,timeout=5) as res:
            if res.status==200:
                html = await res.text()
                html_format = etree.HTML(html)

                hrefs = html_format.xpath("//a[@class='list-group-item random_list']")

                for link in hrefs:
                    url = link.get("href")
                    title = link.xpath("div[@class='random_title']/text()")[0]  # Get file header

                    path = './biaoqings/{}'.format(title.strip())  # Hard coded, you need to create a folder of biaoqings in the root directory of the project first

                    if not os.path.exists(path):
                        os.mkdir(path)
                    else:
                        pass

                    async with s.get(url, headers=headers, timeout=3) as res:
                        if res.status == 200:
                            new_html = await res.text()

                            new_html_format = etree.HTML(new_html)
                            imgs = new_html_format.xpath("//div[@class='artile_des']")
                            for img in imgs:
                                try:
                                    img = img.xpath("table//img")[0]
                                    img_down_url = img.get("src")
                                    img_title = img.get("alt")
                                except Exception as e:
                                    print(e)

                                async with s.get(img_down_url, timeout=3) as res:
                                    img_data = await res.read()
                                    try:
                                        with open("{}/{}.{}".format(path,img_title.replace('\r\n',""),img_down_url.split('.')[-1]),"wb+") as file:
                                            file.write(img_data)
                                    except Exception as e:
                                        print(e)

                        else:
                            pass


            else:
                print("Page access failed")

Wait, a large number of expression packs come to my bowl.

Crawler source download address

Topics: Python Windows