An article, collect four websites, they are sunshine management, picture insect network, Book Companion network and semi dimensional network

Posted by jmarais on Fri, 14 Jan 2022 05:32:53 +0100

Crawler 100 cases column repeat disk series article 3

Case 9: data collection of Hebei sunshine administration complaint section

Unfortunately, the website is not accessible. The new module added in this case is lxml, that is, learning based on this module.

Since we can't access it, let's switch to the truth channel,

In the original case, the data finally obtained is stored in mongodb. The duplicate case is subject to the captured data, and the storage part can refer to the original case.

import requests
import random
from lxml import etree  # Import etree from lxml
ua = ['Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36']
headers = {
# Test cycle 10 pages
for i in range(1,10):
    response = requests.get(f"{i}.html",headers=headers)
    html = response.content.decode("utf-8")

    tree = etree.HTML(html)  # Parsing html
    divs = tree.xpath('//div[@class="listcon"]) # parse list area div
    for div in divs:  # Cycle this area
        	# Note that the following is an xpath lookup through div, and an error is reported in the try mode
            shouli = div.xpath('span[1]/p/a/text()')[0]  # Acceptance unit

            content = div.xpath('span[2]/p/a/text()')[0]  # Complaint content
            datetime = div.xpath('span[3]/p/text()')[0].replace("\n","")  # time
            status = div.xpath('span[5]/p/text()')[0].replace("\n","")  # time
            one_data = {"shouli":shouli,
            print(one_data)  # Print data and store it in mongodb

        except Exception as e:
            print("Internal data error")

The target website has changed from asynchronous request to synchronous data loading. At the same time, the original anti crawl limit has been removed. The website loading speed has obviously become faster. It's great.

Code download

Case 10: image crawler

This case mainly uses Queue. Some global variables involved in the original blog are missing and a variable name is written incorrectly. This replay has been updated.

Original case address:

Source code address after recovery: Case 10

Case 11: Zhou readnet is modified into a Book Companion crawler

The websites are mainly collected and shared e-books, often involving infringement issues. At present, the author has transferred to the official account and has begun charging outside, and this case is also invalid.

It is very simple to restart the case. Just select any online e-book platform, such as book companion network:

This case involves the application of asynio module and AIO HTTP module. It is found that the introduction is not detailed in the process of re disk. This part of the detailed knowledge points are systematically supplemented in 120 crawlers. In this re disk, the case will be adjusted and available first.

The core code is as follows. When reading the code, focus on the tasks part of the task box. The data analysis part is not complete, which belongs to the basic lxml extraction.

import requests
from lxml import etree
# Import collaboration module
import asyncio
import aiohttp

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
           "Host": "",
           "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"}

async def get_content(url):
    # Create a session to get data
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as res:
            if res.status == 200:
                source = await res.text()  # Waiting for text
                tree = etree.HTML(source)
                await async_content(tree)

async def async_content(tree):

    title = tree.xpath("//h1[@class='title']/a/text()")[0]
    # If there is no information on the page, you can return directly
    # if title == '':
    #     return
    # else:
    #     try:
    #         description = tree.xpath("//div[@class='hanghang-shu-content-font']")
    #         author = description[0].xpath("p[1]/text()")[0].replace("Author:", "") if description [0] xpath("p[1]/text()")[0] is not None else None
    #         cate = description[0].xpath("p[2]/text()")[0].replace("classification:", "") if description [0] xpath("p[2]/text()")[0] is not None else None
    #         douban = description[0].xpath("p[3]/text()")[0].replace("watercress score:", "") if description [0] xpath("p[3]/text()")[0] is not None else None
    #         # The content of this part is not clear and no record is made
    #         #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None
    #         download = tree.xpath("//a[@class='downloads']")
    #     except Exception as e:
    #         print(title)
    #         return

    # ls = [
    #     title,author,cate,douban,download[0].get('href')
    # ]
    # return ls

if __name__ == '__main__':
    url_format = "{}.html"
    full_urllist = [url_format.format(i) for i in range(
        50773, 50783)]  # Control to page 3 and get more data by yourself
    loop = asyncio.get_event_loop()
    tasks = [asyncio.ensure_future(get_content(url)) for url in full_urllist]
    results = loop.run_until_complete(asyncio.wait(tasks))

Source code address after recovery: Case 11

Case 12: semi dimensional website COS crawler

First, test the target website and find that it can be opened. Don't worry, at least it's still there.

Then analyze the data request interface and find that changes have taken place, as follows:*****bdiqAACYm*****bdiqAACYm*****bdiqAACYm

Note that the above address has three core parameters p, i.e. page number, and date is the date_ signature should be authority authentication.

It is found that after the last parameter is removed, the data can also be obtained, which is much simpler.

Judge whether to switch the date by judging the interface data. When there is no data, the top of the following JSON_ list_ item_ Info returns null, and the time can be switched at this time.

  "code": 0,
  "msg": "",
  "data": {
    "top_list_item_info": []

Conclusion of today's resumption

In today's reply, it is found that some small knowledge points were omitted due to lack of writing experience. The details of these omissions were uniformly compiled into the new version of 120 cases of Python crawlers for gradual improvement.

Conscience blogger, he hasn't dropped the line for three years.

Collection time

Do an impossible task. After collecting 400, the eraser will reply to everyone in the comment area and send a mysterious code

Today is the 191st / 200th day of continuous writing.
You can pay attention to me, praise me, comment on me and collect me.

More wonderful

Topics: Python Python crawler