Crawler 100 cases column repeat disk series article 3
Case 9: data collection of Hebei sunshine administration complaint section
Unfortunately, the website is not accessible. The new module added in this case is lxml, that is, learning based on this module.
Since we can't access it, let's switch to the truth channel, http://yglz.tousu.hebnews.cn/shss-1.html.
In the original case, the data finally obtained is stored in mongodb. The duplicate case is subject to the captured data, and the storage part can refer to the original case.
import requests import random from lxml import etree # Import etree from lxml ua = ['Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'] headers = { 'user-agent':ua[random.randint(0,2)] } # Test cycle 10 pages for i in range(1,10): response = requests.get(f"http://yglz.tousu.hebnews.cn/shss-{i}.html",headers=headers) html = response.content.decode("utf-8") print("*"*200) tree = etree.HTML(html) # Parsing html divs = tree.xpath('//div[@class="listcon"]) # parse list area div for div in divs: # Cycle this area try: # Note that the following is an xpath lookup through div, and an error is reported in the try mode shouli = div.xpath('span[1]/p/a/text()')[0] # Acceptance unit content = div.xpath('span[2]/p/a/text()')[0] # Complaint content datetime = div.xpath('span[3]/p/text()')[0].replace("\n","") # time status = div.xpath('span[5]/p/text()')[0].replace("\n","") # time one_data = {"shouli":shouli, "type":type, "content":content, "datetime":datetime, "status":status, } print(one_data) # Print data and store it in mongodb except Exception as e: print("Internal data error") print(div) continue
The target website has changed from asynchronous request to synchronous data loading. At the same time, the original anti crawl limit has been removed. The website loading speed has obviously become faster. It's great.
Case 10: image crawler
This case mainly uses Queue. Some global variables involved in the original blog are missing and a variable name is written incorrectly. This replay has been updated.
Original case address: https://dream.blog.csdn.net/article/details/83017079
Source code address after recovery: https://codechina.csdn.net/hihell/scrapy Case 10
Case 11: Zhou readnet is modified into a Book Companion crawler
The websites are mainly collected and shared e-books, often involving infringement issues. At present, the author has transferred to the official account and has begun charging outside, and this case is also invalid.
It is very simple to restart the case. Just select any online e-book platform, such as book companion network: http://www.shuban.net/list-19-1.html
This case involves the application of asynio module and AIO HTTP module. It is found that the introduction is not detailed in the process of re disk. This part of the detailed knowledge points are systematically supplemented in 120 crawlers. In this re disk, the case will be adjusted and available first.
The core code is as follows. When reading the code, focus on the tasks part of the task box. The data analysis part is not complete, which belongs to the basic lxml extraction.
import requests from lxml import etree # Import collaboration module import asyncio import aiohttp headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Host": "www.shuban.net", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"} async def get_content(url): print("Operating:{}".format(url)) # Create a session to get data async with aiohttp.ClientSession() as session: async with session.get(url, headers=headers) as res: if res.status == 200: source = await res.text() # Waiting for text tree = etree.HTML(source) await async_content(tree) async def async_content(tree): title = tree.xpath("//h1[@class='title']/a/text()")[0] print(title) # If there is no information on the page, you can return directly # if title == '': # return # else: # try: # description = tree.xpath("//div[@class='hanghang-shu-content-font']") # author = description[0].xpath("p[1]/text()")[0].replace("Author:", "") if description [0] xpath("p[1]/text()")[0] is not None else None # cate = description[0].xpath("p[2]/text()")[0].replace("classification:", "") if description [0] xpath("p[2]/text()")[0] is not None else None # douban = description[0].xpath("p[3]/text()")[0].replace("watercress score:", "") if description [0] xpath("p[3]/text()")[0] is not None else None # # The content of this part is not clear and no record is made # #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None # download = tree.xpath("//a[@class='downloads']") # except Exception as e: # print(title) # return # ls = [ # title,author,cate,douban,download[0].get('href') # ] # return ls if __name__ == '__main__': url_format = "https://www.shuban.net/read-{}.html" full_urllist = [url_format.format(i) for i in range( 50773, 50783)] # Control to page 3 and get more data by yourself loop = asyncio.get_event_loop() tasks = [asyncio.ensure_future(get_content(url)) for url in full_urllist] results = loop.run_until_complete(asyncio.wait(tasks))
Source code address after recovery: https://codechina.csdn.net/hihell/scrapy Case 11
Case 12: semi dimensional website COS crawler
First, test the target website and find that it can be opened. Don't worry, at least it's still there.
Then analyze the data request interface and find that changes have taken place, as follows:
https://bcy.net/apiv3/rank/list/itemInfo?p=1&ttype=cos&sub_type=week&date=20210726&_signature=Rt2KvAAA*****bdiqAACYm https://bcy.net/apiv3/rank/list/itemInfo?p=2&ttype=cos&sub_type=week&date=20210726&_signature=Rt2KvAAA*****bdiqAACYm https://bcy.net/apiv3/rank/list/itemInfo?p=3&ttype=cos&sub_type=week&date=20210726&_signature=Rt2KvAAA*****bdiqAACYm
Note that the above address has three core parameters p, i.e. page number, and date is the date_ signature should be authority authentication.
It is found that after the last parameter is removed, the data can also be obtained, which is much simpler.
Judge whether to switch the date by judging the interface data. When there is no data, the top of the following JSON_ list_ item_ Info returns null, and the time can be switched at this time.
{ "code": 0, "msg": "", "data": { "top_list_item_info": [] } }
Conclusion of today's resumption
In today's reply, it is found that some small knowledge points were omitted due to lack of writing experience. The details of these omissions were uniformly compiled into the new version of 120 cases of Python crawlers for gradual improvement.
Conscience blogger, he hasn't dropped the line for three years.
Collection time
Do an impossible task. After collecting 400, the eraser will reply to everyone in the comment area and send a mysterious code
Today is the 191st / 200th day of continuous writing.
You can pay attention to me, praise me, comment on me and collect me.
More wonderful