After crawling through all the passages in the embarrassing encyclopedia, I summed up

Posted by sebastienp on Tue, 01 Feb 2022 15:05:50 +0100

Python crawler framework's Scrapy detailed explanation and single page crawling tutorial portal:

 

Scrapy climbed to the introductory tutorial of little sister in station B, and the result was unexpected!

 

Today, let's directly look at the actual combat and climb all the passages in the embarrassing encyclopedia. First, let's take a look at the results we have obtained:

Console

 

json file

 

1. Determine the goal: open the embarrassing Encyclopedia - under the paragraph column. We have five target data for this trip. Author's name, author level, paragraph content, number of likes and comments.

Home page link:

https://www.qiushibaike.com/text/

 

2. Establish the project. We use commands

scrapy startproject qiushibaike

 

3. Then we use the command

scrapy genspider spider_bk www.qiushibaike.com/text/

Build a spider_ The python file of bk.py is used to realize the specific crawler.

 

 

4. Next, let's implement our entity class items Py file. The data we want to obtain are as follows.

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class QiushibaikeItem(scrapy.Item):
    # define the fields for your item here like:
    #Author
    author = scrapy.Field()

    #Author level
    level = scrapy.Field()

    #Content
    context = scrapy.Field()

    #Number of people who agree
    star = scrapy.Field()

    #Number of comments
    comment = scrapy.Field()

 

5. Let's start with the spider in the crawler file_ Bk.py realizes the acquisition of single page data.

import scrapy
from qiushibaike.items import QiushibaikeItem

class SpiderBkSpider(scrapy.Spider):
    name = 'spider_bk'
    allowed_domains = ['www.qiushibaike.com/text/page/1/']
    start_urls = ['https://www.qiushibaike.com/text/']


    def parse(self, response):

        #Instantiation method
        item = QiushibaikeItem()

        #Get all div s of the current page
        divs = response.xpath("//div[@class='col1 old-style-col1']/div")

        for div in divs:
            item['author'] = div.xpath('./div[@class="author clearfix"]/a/h2/text()').get().strip()      #Author
            item['level'] = div.xpath('./div[@class="author clearfix"]/div/text()').get()        #Author level

            content = div.xpath(".//div[@class='content']//text()").getall()
            content = " ".join(content).strip()                     #Content
            item['context'] = content

            item['star'] = div.xpath('./div/span/i/text()').get()                               #Number of people who agree
            item['comment'] = div.xpath('./div/span/a/i/text()').get()                           #Number of comments

            yield item

 

6. We save the acquired data into pipelines Py's embarrassing encyclopedia In the json file, in order to make it easy to see the data, we print out the data before saving.

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json

class QiushibaikePipeline:
    def process_item(self, item, spider):
        print(item['author'])
        print(item['level'])
        print(item['context'])
        print(item['star'])
        print(item['comment'])

        #Save file locally
        with open('./Embarrassing Encyclopedia.json', 'a+', encoding='utf-8') as f:
            lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
            f.write(lines)

        return item

 

7. We create another main method to print the data on the console. This avoids the trouble of writing commands on the command line every time.

from scrapy import cmdline
cmdline.execute('scrapy crawl spider_bk -s LOG_FILE=all.log'.split())

 

8. Next, let's go to setting Open the following settings in py:

from fake_useragent import UserAgent

BOT_NAME = 'qiushibaike'

SPIDER_MODULES = ['qiushibaike.spiders']
NEWSPIDER_MODULE = 'qiushibaike.spiders'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'User-Agent': str(UserAgent().random),
}

ITEM_PIPELINES = {
    'qiushibaike.pipelines.QiushibaikePipeline': 300,
}

 

9. In the last step, we write the main method to print single page data to the console.

 

You can see that the single page data has been successfully obtained. We can see that the embarrassing Encyclopedia has 13 pages in total, and our goal is to have all the data on these 13 pages.

 

10. We are in the crawler file spider_ Add a loop to BK to get all the data.

    def start_requests(self):
        #Get page turning URL
        for page in range(1, 13 + 1):
            url = 'https://www.qiushibaike.com/text/page/{}/'.format(str(page)) # extract page turning links
            yield scrapy.Request(url, callback=self.parse)

 

Execute the main method again, and all 325 pieces of data on page 13 have been obtained locally.

Topics: Python xpath csv