Scrapy framework quickly crawls the data storage of embarrassing encyclopedia and crawls multiple pages [advanced introduction to python crawler] (17)

Posted by nicky77uk1 on Mon, 10 Jan 2022 01:27:15 +0100

Hello, I'm brother Manon Feige. Thank you for reading this article. Welcome to one click three times.
😁 1. Take a stroll around the community. There are benefits and surprises every week. Manon Feige community, leap plan
💪🏻 2. Python basic column, basic knowledge, 9.9 yuan can't afford to lose, and can't be fooled. Python from introduction to mastery
❤️ 3. Python crawler column, systematically learn the knowledge points of crawlers. 9.9 yuan can't afford to lose, can't afford to be cheated, and is constantly updating. python crawler beginner level
❤️ 4. Ceph has everything from principle to actual combat. Ceph actual combat
❤️ 5. Introduction to Java high concurrency programming, punch in and learn java high concurrency. Introduction to Java high concurrency programming
Pay attention to the official account below, and many welfare prostitution. Add me VX to study in the group. I'm not alone on the way to study

preface

The last article briefly introduced how to use the Scrapy framework for crawling. Quick start to the Scrapy framework, taking the embarrassing encyclopedia as an example [advanced introduction to python crawler] (16) However, the previous article only introduced the data crawling of a single page, and the data saving and crawling of multiple pages have not been introduced. Therefore, this article will introduce data storage and crawling multiple pages.

Return data

class SpiderQsbkSpider(scrapy.Spider):
    # Identifies the name of the reptile
    name = 'spider_qsbk'
    # Identifies the domain name allowed by the crawler
    allowed_domains = ['qiushibaike.com']
    # Start page
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        print(type(response))
        # SelectorList
        div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
        print(type(div_list))
        for div in div_list:
            # Selector
            author = div.xpath('.//h2/text()').get().strip()
            print(author)
            content = div.xpath('.//div[@class="content"]//text()').getall()
            content = "".join(content).strip()
            duanzi = {'author': author, 'content': content}
            yield duanzi

Here we mainly look at the following two lines of code. The code is to put the author and content in the dictionary, and then return the data as a generator.

 duanzi = {'author': author, 'content': content}
 yield duanzi

amount to

    items=[]
    items.append(item) 
    return items

data storage

The code for storing data in the Scrapy framework is placed in pipelines Py, that is, receive the item s returned by the crawler through pipelines.

QsbkPipeline

The QsbkPipeline class has three methods:

  1. open_spider(self, spider): executed when the crawler is opened.
  2. process_item(self, item, spider): called when an item is passed by the crawler.
  3. close_spider(self,spider): called when the crawler is closed.
    To activate piplilne, it should be in settings Set item in PY_ PIPELINES.
class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'w', encoding='utf-8')

    # Open file
    def open_spider(self, spider):
        print('This is the beginning of the reptile.....')

    def process_item(self, item, spider):
        item_json = json.dumps(item, ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('This is the end of the reptile.....')

Define a Duanzi in the construction method JSON file, which is used to store the item passed by the crawler. json. The dumps method can convert the dictionary into a JSON string. self.fp.write(item_json + '\n') each time a JSON string is written, a new line is added.

Data transmission optimization - Item as data model

The item returned by the previous spider is a dictionary, which is not recommended by scripy. The recommended method of Scrapy is to pass the data through the scratch The item is encapsulated. The item class in this example is QsbkItem

import scrapy
class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

Then, the returned data is received through the QsbkItem object in the spider qsbkspider of the crawler.

  item = QsbkItem(author=author, content=content)
  # Return as generator
  yield item

The advantage of this method is to standardize the data structure of item. The code also looks relatively concise.
In QsbkPipeline, you only need to make a small modification to the item when serializing data_ json = json. dumps(dict(item), ensure_ ascii=False).

Data storage optimization - use the exported class of the graph

Using the JsonItemExporter class

First, use the JsonItemExporter class. This class is opened in binary form, and the reading mode of all files specified in the open method is wb mode. The specified encoding type is utf-8, and ascii encoding is not performed.

from scrapy.exporters import JsonItemExporter
class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'wb')
        self.exporter = JsonItemExporter(self.fp, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    # Open file
    def open_spider(self, spider):
        print('This is the beginning of the reptile.....')
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print('This is the end of the reptile.....')

Self. Needs to be called before the crawler starts exporter. start_ The exporting () method starts importing. After the crawler is finished, you need to call self exporter. finish_ The exporting () method completes the import.
After running, the result is:

The operation logic of this method is to call export_ When the item method, put the data item into a list and call finish_ After the exporting method, the data of this list is uniformly written to the file. If there is more data, it will consume more memory. There is another better way.

Using the jsonlinestitemexporter

The method of jsonlinesiteexporter is to call export every time_ The item will be stored on disk when the item is. You do not need to call start_exporting method and finish_exporting method.

from scrapy.exporters import JsonLinesItemExporter


class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.fp, encoding='utf-8', ensure_ascii=False)

    # Open file
    def open_spider(self, spider):
        print('This is the beginning of the reptile.....')
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('This is the end of the reptile.....')

Summary

When saving json data, you can use the JsonItemExporter class and the jsonlinesiteexporter class to make the operation easier.

  1. JsonItemExporter: This is to write data to disk every time it is added to memory. The advantage is that the stored data is one that meets the json rules. The disadvantage is that if the amount of data is large, it will consume more memory.
  2. Jsonlinestitemexporter: This is every time export is called_ The item will be stored on disk when the item is. The disadvantage is that each dictionary is a line, and the whole file is not a file that meets the json format. The advantage is that each time the data is processed, it is directly saved to the disk, so it will not consume memory. The data is also relatively safe.

Of course, the Scrapy framework also provides us with XmlItemExporter, CsvItemExporter and other export classes.

Crawl multiple pages

The data crawling and storage of a single page have been completed, and the next step is to crawl multiple pages. Here, we first need to get to the next page of the current page and cycle to the last page.

class SpiderQsbkSpider(scrapy.Spider):
    base_domain = 'https://www.qiushibaike.com'
    # Identifies the name of the reptile
    name = 'spider_qsbk'
    # Identifies the domain name allowed by the crawler
    allowed_domains = ['qiushibaike.com']
    # Start page
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        print(type(response))
        # SelectorList
        div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
        print(type(div_list))
        for div in div_list:
            # Selector
            author = div.xpath('.//h2/text()').get().strip()
            print(author)
            content = div.xpath('.//div[@class="content"]//text()').getall()
            content = "".join(content).strip()
            item = QsbkItem(author=author, content=content)
            # Return as generator
            yield item
     # Gets the next page of the current page
        next_url = response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

Here we mainly look at the following code.

  next_url = response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
  if not next_url:
        return
  else:
        yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

The link that the crawler gets to the next second is / text/page/2 /. So you need to add the domain name to spell a complete address.

Loop through the call to sweep Request (self. Base_domain + next_url, callback = self. Parse) method to request data. The callback parameter is used to specify the calling method.
You also need to change settings here Download in PY_ Delay = 1 is used to set the download interval to 1 second.

summary

This paper briefly introduces how to quickly crawl the data storage of embarrassing encyclopedia and crawl multiple pages through the Scrapy framework

Exclusive benefits for fans

Soft test materials: Practical soft test materials

Interview questions: 5G Java high frequency interview questions

Learning materials: 50G various learning materials

Withdrawal secret script: reply to [withdrawal]

Concurrent programming: reply to [concurrent programming]

											👇🏻 The verification code can be obtained by searching the official account below.👇🏻 

Topics: Python crawler scrapy