Python crawler framework scratch getting started the first crawler: using xpath to crawl the start Chinese Web

Posted by msing on Thu, 20 Feb 2020 07:20:41 +0100

1, Demand analysis

To climb the 24-hour hot list of novels on the start of Chinanet: https://www.qidian.com/rank/hotsales?style=1

As the first crawler, we only crawl the first page of each novel's name, author, type, and whether it is serialized.

2, Create project

Open the command line, switch to the file path where you want to save the project, and enter scratch startproject Qidian? Hot to create a project project named Qidian.
Open the file path, and you can see that a folder of "Qi Dian" has been generated, in which the relevant files of the plot are saved.

3, Analysis page

Use Google browser or other browsers with developer tools to open the 24-hour hot sales list of the start Chinese website https://www.qidian.com/rank/hotsales?style=1
Then open the developer tool

Click refresh page to find the html code of the page response
Click the arrow in the upper right corner, and then move the mouse to the information column of the novel. Click, and you will see that the code box on the right will automatically locate the corresponding code of the information.

We can see that there are four labels corresponding to the information of this novel
Expand the tag to see the contents. Next, we just need to create a crawler to get the response and extract the information we want

4, Implement crawler

Open the crawler project we created earlier, and create the crawler file in the spider directory, which is required for importing

from scrapy import Request
from scrapy.spiders import  Spider

Create a crawler class that inherits from Spider, including crawler name, target page, resolution method

class HotSaleSpider(Spider):
    name = "hot" #Reptile name
    start_urls = ["https://www.qidian.com/rank/hotsales?style=1"]#Destination website address
    def parse(self, response):#For data analysis
    	pass

Notice that the parameter of the parse method is response. This is because we can get the response of the web page automatically by using the crawler framework. We don't need to write another code to send a request to the web page. We have already sent a request and returned the response directly in the inside of the summary.

The following figure shows the structure of the summary framework. What we need to do is to complete the spider part: extract the information after getting the response of the web page. Completed 1-6 work inside the reptile.

So we focus on the parse method. Based on our previous web page analysis, we use the xpath method to extract information.

First, we navigate to class = "Book mid info", where the information of each novel is saved under the "Book mid info" class.

list_selector = response.xpath("//div[@class='book-mid-info']")

Then traverse each novel to get all kinds of information. The same way to locate is to check the label where the information is located. The title of the novel "master of mysteries" is saved in the a tag under the h4 tag in the "Book mid info" class, and the author of the novel "squid loving diving" is saved in the a tag under the p tag in the "Book mid info" class.

for one_selector in list_selector:
    # Access to novel information
    name = one_selector.xpath("h4/a/text()").extract()[0] #Extract the text under the current positioning
    author = one_selector.xpath("p[1]/a[1]/text()").extract()[0]
    type = one_selector.xpath("p[1]/a[2]/text()").extract()[0]
    form = one_selector.xpath("p[1]/span/text()").extract()[0]

Finally, a dictionary is defined to save the information of each novel and return it through the generator:

# Keep the novel information as a dictionary
hot_dict = {
    "name":name,
    "author":author,
    "type":type,
    "form":form
}
yield hot_dict # Generator returns information of each novel

For the meaning of yield keyword, please refer to: Python keyword: yield generator

Full code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/2/20 11:45
#@Author: bz
#@File  : qidian_spider.py

from scrapy import Request
from scrapy.spiders import  Spider

class HotSaleSpider(Spider):
    name = "hot" #Reptile name
    start_urls = ["https://www.qidian.com/rank/hotsales?style=1"]#Destination website address
    def parse(self, response):#For data analysis
        #Using xpath to locate
        list_selector = response.xpath("//div[@class='book-mid-info']")
        for one_selector in list_selector:
            # Access to novel information
            name = one_selector.xpath("h4/a/text()").extract()[0] #Extract the text under the current positioning
            author = one_selector.xpath("p[1]/a[1]/text()").extract()[0]
            type = one_selector.xpath("p[1]/a[2]/text()").extract()[0]
            form = one_selector.xpath("p[1]/span/text()").extract()[0]
            # Keep the novel information as a dictionary
            hot_dict = {
                "name":name,
                "author":author,
                "type":type,
                "form":form
            }
            yield hot_dict # Generator returns information of each novel

5, Run crawler

Go to the folder where you created the project before, and run the crawler and save the information to the qidianhot.csv file in the folder by entering the crawler hot - O qidianhot.csv
After running the crawler, open qidianhot.csv to see the crawling information