#Introduction to Python crawler #Item Pipeline (attached to crawl website to get pictures to local code)

Posted by SheDesigns on Thu, 06 Jan 2022 08:24:04 +0100

1 Item Pipeline

After the spider crawls to the item, it is sent to the Item Pipeline and processed in sequence through several components. Each Item Pipeline is a Python class that implements a simple method. It receives an item and performs an operation on it. It also determines whether the item should continue to pass through the pipeline or be discarded and no longer processed.
The typical uses of Item Pipeline are:
1. Clean up HTML data
2. Verify the crawled data (check whether items contain some fields)
3. Check the copies (and delete them)
4. Store the item data in the database

1.1 write your own Item Pipeline

Each Item Pipeline is a Python class that must implement the following methods:
process\_item(self, item, spider)
This method can be called by each Item Pipeline, process\_item() must be: return a dictionary type data, return an item (or any subclass) object, return a Twisted Deferred or DropItem exception, and the discarded item will not be processed by further Item Pipeline.
Parameter meaning:
Item: item object or dictionary, crawled item
Spider: the spider object, which crawls the spider of this item
In addition, they can implement the following methods:
open\_spider(self, spider)
When the spider is opened, the function will be called. The meaning of the spider parameter: the opened spider
close\_spider(self, spider)
When the spider is closed, the function is called
from\_crawler(cls, crawler)
If so, this class method is called to create a spider instance from a Crawler. It must return a new instance of the pipeline, and the Crawler object provides access to all the core components of the sweep, such as settings and signals; This is a way for pipes to access them and connect their functionality to the script.

1.2 Pipeline example

1.2.1 price verification example

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

1.2.2 write json file

The following Pipeline stores all passing items (from all spiders) into an item JL file, where each line is serialized in JSON format:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

1.2.3 write to MongoDB

In this example, we will use pymongo to write items to MongoDB. MongoDB address and database name are specified in scene settings; The MongoDB collection is named after the item class. The main purpose of this example is to show how to use from\_crawler() method and how to clean up resources correctly.

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

1.2.4 repeat filter

The filter used to find duplicate items and delete the processed items. It is assumed that our items have a unique id, but our spider returns multiple items with the same id:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

1.2.5 activate Item Pipeline

Item must be set in setting to activate Item Pipeline\_ Pipelines, for example:

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

In this setting, the integer values assigned to classes determine the order in which they run: item s are executed from low to high, and the integer value range is 0-1000.

2 Feed exports

One of the most common features required when executing a sweep is to correctly store the crawled data. Sweep provides this function, allowing multiple serialization formats to generate a Feed.

2.1 serialization format

The data formats used to serialize a scratch are mainly of the following types:

  • JSON
  • JSON lines
  • CSV
  • XML

You can also use feed in setting\_ The exporters field to extend the supported formats.
JSON
FEED\_FORMAT: json
Class used: JsonItemExporter
JSON lines
FEED\_FORMAT: jsonlines
Class used: jsonlinesiteexporter
CSV
FEED\_FORMAT: csv
Class used: CsvItemExporter
XML
FEED\_FORMAT: xml
Class used: XmlItemExporter
Pickle
FEED\_FORMAT: pickle
Class used: PickleItemExporter
Marshal
FEED\_FORMAT: marshal
Class used: marshaleitemexporter

2.2 application method

Enter the project directory and execute the command:

scrapy crawl tushu -o tushu.json

Use the - o parameter followed by the format to be output.

3 download and process files and images

Scratch provides reusable item pipelines for downloading files related to specific items (for example, when you crawl products and want to download their images locally). These pipelines share some functions and structures (we call them media pipelines), but they usually use either Files Pipeline or Images Pipeline.
Both pipelines implement these features:

  • Avoid re downloading recently downloaded media
  • Specify the location of the storage media (file system directory, etc.)

Image Pipeline has some additional functions for processing images:

  • Convert all downloaded images to common format (JPG) and mode (RGB)
  • Generate thumbnails
  • Check the image width / height to ensure that they meet the minimum constraints

Pipeline reserves an internal queue for the URL of the media being downloaded. Connect the response containing the same media to the queue, so as to avoid downloading the same media when multiple items are shared.

3.1 using Files Pipeline

The typical workflow of using Files Pipeline is as follows:
1. In a spider, you extract an item and put the required urls into file\_urls field;
2. The item will return from the spider and enter the item pipeline;
3. When the item arrives at FilePipeline, click file\_urls in the urls field will be downloaded using the standard script scheduler and downloader (which means that the scheduler and downloader middleware are reused). If the priority is higher, it will be processed before other pages are crawled. Item will remain in the "locker" state in this specific pipline until the download is completed (or the download is not completed for some reason).
4. When downloading a file, another field (files) will be filled with the result. This field will contain a dictionary of information about the downloaded file, such as download path, original url (from file \ _urlsfield) and file verification. Files in the file field list will retain the original file\_ The order of URLs field is the same. If there is a file that fails to download, the error will be recorded, and file will not be recorded in the files field.

3.2 using Images Pipeline

The use of Images Pipeline is similar to that of File Pipeline, except that the default field name is different and image is used\_ urls is a picture of an item. It will fill in an image field to get information about the downloaded image.
The advantage of using ImagesPipeline for processing image files is that you can configure some additional functions, such as generating thumbnails and filtering images according to their size.
The Images Pipeline program uses the pipeline module to format images into JPEG/RGB format, so you also need to install the pipeline module. In most cases, we use PIL, but it is well known that it will cause trouble in some cases, so we recommend using pipeline.

3.3 using Media Pipeline

If you want to use Media Pipeline, you must add item in the setting of the project\_ Pipelines setting, for Images Pipeline, use:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

Files Pipeline, using:

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

Note: Images Pipeline and Files Pipeline can be used at the same time.

Then, configure the target storage setting to a valid value that will be used to store the downloaded image. Otherwise, even if you configure ITEM\_PIPELINES is also disabled.
If it is a File Pipeline, add files in setting\_ STORE:

FILES_STORE = '/path/to/valid/dir'

For Image Pipeline, add images in setting\_ STORE:

IMAGES_STORE = '/path/to/valid/dir'

3.4 supported storage

At present, the only official support is the file system, but it also supports similar Amazon S3 and Google Cloud Storage

3.5 examples

1. First use media pipline. First enable it and configure it in setting:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

2. Then set the fields images and image\_urls:

import scrapy

class MyItem(scrapy.Item):

    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

3. Add download path and field in setting:

# Image download storage path
ITEM_STORE = 'E:\\'

To avoid downloading recently downloaded files, you can set FILES\_EXPIRES or images\_ To configure the cache time with expires:

# Expires after 120 days
FILES_EXPIRES = 120

# Expires in 30 days
IMAGES_EXPIRES = 30

Images Pipline can automatically create thumbnails of downloaded images and add images in setting\_ Thumbs parameter. The parameter is a dictionary, where the key is the thumbnail name and the value is their dimension:

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

If you want to filter out small pictures, set IMAGES\_MIN\_HEIGHT and IMAGES\_MIN\_WIDTH to specify the image size:

IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

The configuration of this value will not affect the generation of thumbnails.
Through the above configuration, we can add the function of downloading pictures for our crawler.

4 little reptile

After all that has been said above, you may feel confused. Next, let's use a small item to explain specifically. The website we want to climb is (pictures of various houses on the second-hand house page of SouFun) as shown below:

Get the picture in the detail page of each item in the web page list.

4.1 enable pipeline

setting.py, add the following contents:

# New content starts here#################################################################
# Start pipline
ITEM_PIPELINES = {
    # Note: if you want to customize the picture name, this item should be annotated, otherwise the custom picture name will not take effect
    'scrapy.pipelines.images.ImagesPipeline': 1,
        # After customizing the picture name, you can uncomment this item
    # 'sp.pipelines.SpDownimagePipeline': 200,
}
# Picture saving address
IMAGES_STORE = 'E:\\'
# Pictures expire 30 days
IMAGES_EXPIRES = 30
# set thumbnail 
# IMAGES_THUMBS = {
#     'small': (50, 50),
#     'big': (270, 270),
# }
# Filter small pictures
# IMAGES_MIN_HEIGHT = 110
# IMAGES_MIN_WIDTH = 110
# Allow redirection
MEDIA_ALLOW_REDIRECTS = True
# Control time, Download waiting time 3 seconds
DOWNLOAD_DELAY = 3
# Request user agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
# The new content ends here#################################################################

4.2 configuring items

Set the name field image of the web page to be crawled and the image link field image in the crawled web page\_ urls,items.py code is as follows:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class SpItem(scrapy.Item):
    """
    definition item field
    """
    # Page name
    image = scrapy.Field()
    # Picture links in web pages
    image_urls = scrapy.Field()

4.3 spider

Our reptile FTX Py code is as follows:

# -*- coding: utf-8 -*-
import scrapy
from sp.items import SpItem

class MyBlog(scrapy.Spider):
    name = 'ftx'
    start_urls = ['http://esf.fang.com']

    def parse(self, response):
        """
        Crawling initial start_urls List item url(Relative path),adopt response.follow generate
        request,Pass back function as parameter parse_item
        """
        # Get the link address (relative address) of all second-hand houses on the home page
        page_urls = response.css("p.title a::attr(href)").extract()
        for page_url in page_urls:
            # The relative address uses response Stitching URLs in follow mode
            request = response.follow(page_url, callback=self.parse_item)
            # If the connection obtained is an absolute address, use the following method
            # request = scrapy.Request(page_url, callback=self.parse_item)
            yield request

    def parse_item(self, response):
        """
        handle item function
        :param response: Requested page content
        :return: item
        """
        # Import item class
        item = SpItem()
        # Pictures in each personal page
        # Image is the title of each detail page, image_urls is the url of the picture in each detail page
        item['image'] = response.css("div.floatl::text").extract_first().strip()
        item['image_urls'] = response.css("img.loadimg::attr(data-src)").extract()
        yield item

4.4 custom image pipeline

Go directly to the code pipelines py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
from scrapy.pipelines.images import ImagesPipeline

class SpDownimagePipeline(ImagesPipeline):
    """
    Custom picture download class
    """
    def get_media_requests(self, item, info):
        """
        ImagesPipeline Class method,Each image must be returned URL of Request
        :param item:Acquired item
        """
        # Get the picture url from item and send the request, image_urls is items Fields defined in PY
        for image_url in item['image_urls']:
            # The function of meta is to pass the value of item to the next function for use, which is similar to caching first
            yield scrapy.Request(image_url, meta={'item': item})

    def item_completed(self, results, item, info):
        """
        There is no change here, just ImagesPipeline The method must be returned item
        """
        if isinstance(item, dict) or self.images_result_field in item.fields:
            item[self.images_result_field] = [x for ok, x in results if ok]
        return item

    def file_path(self, request, response=None, info=None):
        """
        file_path by ImagePipeline The built-in method. Here we rewrite this method,
        In order to customize the name of the picture, if it is not rewritten, SHA1 hash Format, similar full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
        """
        # Get item from get_media_requests from the Request
        item = request.meta['item']
        # The name of the picture. After the first version is split ('/'), the last value is - 1. Here, - 1 is not used because the last field of the picture is not a random number
        # Is the length multiplied by the width, e.g. 452x340c Jpg, easy to duplicate, so use - 2, the penultimate field
        image_guid = request.url.split('/')[-2] + '.jpg'
        # Full name, including path and picture name
        fullname = "full/%s/%s" % (item['image'], image_guid)
        return fullname

The two get methods for imagespipline\_ media\_ Requests and items\_ Completed here is an explanation:
get\_media\_requests(item, info)
pipeline will get the urls of image and download it from item, so we can rewrite get\_media\_requests method and returns the request of each url:

def get_media_requests(self, item, info):
    for file_url in item['file_urls']:
        yield scrapy.Request(file_url)

These requests will be processed by pipeline. When the download is completed, the results will be sent to item in the form of 2-element tuples\_ For the completed method, each tuple will contain (success, file\_info\_or\_error), similar to the following form:

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

success: Boolean value. It returns True if the picture is downloaded successfully. It returns False if the picture fails to be downloaded. file\_info\_or\_error: a dictionary is returned, including url, path and checksum. If there is a problem, it returns Twisted Failure.

  • The url represents where the file was downloaded from, which is from get\_media\_requests returns the url of the request
  • Path represents the file storage path
  • checksum represents MD5 hash of image content

item\_completed(results, item, info)
When all picture requests in a single project are completed (download completed or download failed), this method will be called, and the results parameter is get\_media\_requests is the result returned after downloading, item\_completed must return the pipeline whose output is sent to the next stage. So you must return or delete the item, just like other pipeline operations before.
In the following example, we store the downloaded file path (passed in results) in file\_ In the path item field, delete the item if it does not contain any files.

from scrapy.exceptions import DropItem

def item_completed(self, results, item, info):
    file_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no files")
    item['file_paths'] = file_paths
    return item

The following is a complete example of a custom Image pipeline:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

Topics: Python crawler