Written in front
Today, in CSDN blog, I found that many people are crawling through a website called doutula. There are many expression packs in it, and then I look at them. There are various implementation methods. Today, I will implement a multi-threaded version for you. Key technology point aiohttp, you can take a look at my previous article, and then learn it.
The website will not be analyzed, but only to find the rule, splicing URL, matching key points, and then crawling.
Bar code
First, quickly import the modules we need. Unlike other articles, I put the same expression under the same folder, so I need to import the os module
import asyncio import aiohttp from lxml import etree import os
Write the main entry method
if __name__ == '__main__': url_format = "http://www.doutula.com/article/list/?page={}" urls = [url_format.format(index) for index in range(1,586)] loop = asyncio.get_event_loop() tasks = [x_get_face(url) for url in urls] results = loop.run_until_complete(asyncio.wait(tasks))
We want to learn, not attack other servers, so limit the number of concurrent servers
sema = asyncio.Semaphore(3) async def x_get_face(url): with(await sema): await get_face(url)
Finally, the operation is as fierce as a tiger. You can complete all the codes and get it done. There is nothing new in this part. Look for the picture link and download it.
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"} async def get_face(url): print("In operation{}".format(url)) async with aiohttp.ClientSession() as s: async with s.get(url,headers=headers,timeout=5) as res: if res.status==200: html = await res.text() html_format = etree.HTML(html) hrefs = html_format.xpath("//a[@class='list-group-item random_list']") for link in hrefs: url = link.get("href") title = link.xpath("div[@class='random_title']/text()")[0] # Get file header path = './biaoqings/{}'.format(title.strip()) # Hard coded, you need to create a folder of biaoqings in the root directory of the project first if not os.path.exists(path): os.mkdir(path) else: pass async with s.get(url, headers=headers, timeout=3) as res: if res.status == 200: new_html = await res.text() new_html_format = etree.HTML(new_html) imgs = new_html_format.xpath("//div[@class='artile_des']") for img in imgs: try: img = img.xpath("table//img")[0] img_down_url = img.get("src") img_title = img.get("alt") except Exception as e: print(e) async with s.get(img_down_url, timeout=3) as res: img_data = await res.read() try: with open("{}/{}.{}".format(path,img_title.replace('\r\n',""),img_down_url.split('.')[-1]),"wb+") as file: file.write(img_data) except Exception as e: print(e) else: pass else: print("Page access failed")
Wait, a large number of expression packs come to my bowl.