1 Item Pipeline
After the spider crawls to the item, it is sent to the Item Pipeline and processed in sequence through several components. Each Item Pipeline is a Python class that implements a simple method. It receives an item and performs an operation on it. It also determines whether the item should continue to pass through the pipeline or be discarded and no longer processed.
The typical uses of Item Pipeline are:
1. Clean up HTML data
2. Verify the crawled data (check whether items contain some fields)
3. Check the copies (and delete them)
4. Store the item data in the database
1.1 write your own Item Pipeline
Each Item Pipeline is a Python class that must implement the following methods:
process\_item(self, item, spider)
This method can be called by each Item Pipeline, process\_item() must be: return a dictionary type data, return an item (or any subclass) object, return a Twisted Deferred or DropItem exception, and the discarded item will not be processed by further Item Pipeline.
Parameter meaning:
Item: item object or dictionary, crawled item
Spider: the spider object, which crawls the spider of this item
In addition, they can implement the following methods:
open\_spider(self, spider)
When the spider is opened, the function will be called. The meaning of the spider parameter: the opened spider
close\_spider(self, spider)
When the spider is closed, the function is called
from\_crawler(cls, crawler)
If so, this class method is called to create a spider instance from a Crawler. It must return a new instance of the pipeline, and the Crawler object provides access to all the core components of the sweep, such as settings and signals; This is a way for pipes to access them and connect their functionality to the script.
1.2 Pipeline example
1.2.1 price verification example
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item)
1.2.2 write json file
The following Pipeline stores all passing items (from all spiders) into an item JL file, where each line is serialized in JSON format:
import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item
1.2.3 write to MongoDB
In this example, we will use pymongo to write items to MongoDB. MongoDB address and database name are specified in scene settings; The MongoDB collection is named after the item class. The main purpose of this example is to show how to use from\_crawler() method and how to clean up resources correctly.
import pymongo class MongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item
1.2.4 repeat filter
The filter used to find duplicate items and delete the processed items. It is assumed that our items have a unique id, but our spider returns multiple items with the same id:
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item
1.2.5 activate Item Pipeline
Item must be set in setting to activate Item Pipeline\_ Pipelines, for example:
ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, }
In this setting, the integer values assigned to classes determine the order in which they run: item s are executed from low to high, and the integer value range is 0-1000.
2 Feed exports
One of the most common features required when executing a sweep is to correctly store the crawled data. Sweep provides this function, allowing multiple serialization formats to generate a Feed.
2.1 serialization format
The data formats used to serialize a scratch are mainly of the following types:
- JSON
- JSON lines
- CSV
- XML
You can also use feed in setting\_ The exporters field to extend the supported formats.
JSON
FEED\_FORMAT: json
Class used: JsonItemExporter
JSON lines
FEED\_FORMAT: jsonlines
Class used: jsonlinesiteexporter
CSV
FEED\_FORMAT: csv
Class used: CsvItemExporter
XML
FEED\_FORMAT: xml
Class used: XmlItemExporter
Pickle
FEED\_FORMAT: pickle
Class used: PickleItemExporter
Marshal
FEED\_FORMAT: marshal
Class used: marshaleitemexporter
2.2 application method
Enter the project directory and execute the command:
scrapy crawl tushu -o tushu.json
Use the - o parameter followed by the format to be output.
3 download and process files and images
Scratch provides reusable item pipelines for downloading files related to specific items (for example, when you crawl products and want to download their images locally). These pipelines share some functions and structures (we call them media pipelines), but they usually use either Files Pipeline or Images Pipeline.
Both pipelines implement these features:
- Avoid re downloading recently downloaded media
- Specify the location of the storage media (file system directory, etc.)
Image Pipeline has some additional functions for processing images:
- Convert all downloaded images to common format (JPG) and mode (RGB)
- Generate thumbnails
- Check the image width / height to ensure that they meet the minimum constraints
Pipeline reserves an internal queue for the URL of the media being downloaded. Connect the response containing the same media to the queue, so as to avoid downloading the same media when multiple items are shared.
3.1 using Files Pipeline
The typical workflow of using Files Pipeline is as follows:
1. In a spider, you extract an item and put the required urls into file\_urls field;
2. The item will return from the spider and enter the item pipeline;
3. When the item arrives at FilePipeline, click file\_urls in the urls field will be downloaded using the standard script scheduler and downloader (which means that the scheduler and downloader middleware are reused). If the priority is higher, it will be processed before other pages are crawled. Item will remain in the "locker" state in this specific pipline until the download is completed (or the download is not completed for some reason).
4. When downloading a file, another field (files) will be filled with the result. This field will contain a dictionary of information about the downloaded file, such as download path, original url (from file \ _urlsfield) and file verification. Files in the file field list will retain the original file\_ The order of URLs field is the same. If there is a file that fails to download, the error will be recorded, and file will not be recorded in the files field.
3.2 using Images Pipeline
The use of Images Pipeline is similar to that of File Pipeline, except that the default field name is different and image is used\_ urls is a picture of an item. It will fill in an image field to get information about the downloaded image.
The advantage of using ImagesPipeline for processing image files is that you can configure some additional functions, such as generating thumbnails and filtering images according to their size.
The Images Pipeline program uses the pipeline module to format images into JPEG/RGB format, so you also need to install the pipeline module. In most cases, we use PIL, but it is well known that it will cause trouble in some cases, so we recommend using pipeline.
3.3 using Media Pipeline
If you want to use Media Pipeline, you must add item in the setting of the project\_ Pipelines setting, for Images Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
Files Pipeline, using:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
Note: Images Pipeline and Files Pipeline can be used at the same time.
Then, configure the target storage setting to a valid value that will be used to store the downloaded image. Otherwise, even if you configure ITEM\_PIPELINES is also disabled.
If it is a File Pipeline, add files in setting\_ STORE:
FILES_STORE = '/path/to/valid/dir'
For Image Pipeline, add images in setting\_ STORE:
IMAGES_STORE = '/path/to/valid/dir'
3.4 supported storage
At present, the only official support is the file system, but it also supports similar Amazon S3 and Google Cloud Storage
3.5 examples
1. First use media pipline. First enable it and configure it in setting:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
2. Then set the fields images and image\_urls:
import scrapy class MyItem(scrapy.Item): # ... other item fields ... image_urls = scrapy.Field() images = scrapy.Field()
3. Add download path and field in setting:
# Image download storage path ITEM_STORE = 'E:\\'
To avoid downloading recently downloaded files, you can set FILES\_EXPIRES or images\_ To configure the cache time with expires:
# Expires after 120 days FILES_EXPIRES = 120 # Expires in 30 days IMAGES_EXPIRES = 30
Images Pipline can automatically create thumbnails of downloaded images and add images in setting\_ Thumbs parameter. The parameter is a dictionary, where the key is the thumbnail name and the value is their dimension:
IMAGES_THUMBS = { 'small': (50, 50), 'big': (270, 270), }
If you want to filter out small pictures, set IMAGES\_MIN\_HEIGHT and IMAGES\_MIN\_WIDTH to specify the image size:
IMAGES_MIN_HEIGHT = 110 IMAGES_MIN_WIDTH = 110
The configuration of this value will not affect the generation of thumbnails.
Through the above configuration, we can add the function of downloading pictures for our crawler.
4 little reptile
After all that has been said above, you may feel confused. Next, let's use a small item to explain specifically. The website we want to climb is (pictures of various houses on the second-hand house page of SouFun) as shown below:
Get the picture in the detail page of each item in the web page list.
4.1 enable pipeline
setting.py, add the following contents:
# New content starts here################################################################# # Start pipline ITEM_PIPELINES = { # Note: if you want to customize the picture name, this item should be annotated, otherwise the custom picture name will not take effect 'scrapy.pipelines.images.ImagesPipeline': 1, # After customizing the picture name, you can uncomment this item # 'sp.pipelines.SpDownimagePipeline': 200, } # Picture saving address IMAGES_STORE = 'E:\\' # Pictures expire 30 days IMAGES_EXPIRES = 30 # set thumbnail # IMAGES_THUMBS = { # 'small': (50, 50), # 'big': (270, 270), # } # Filter small pictures # IMAGES_MIN_HEIGHT = 110 # IMAGES_MIN_WIDTH = 110 # Allow redirection MEDIA_ALLOW_REDIRECTS = True # Control time, Download waiting time 3 seconds DOWNLOAD_DELAY = 3 # Request user agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' # The new content ends here#################################################################
4.2 configuring items
Set the name field image of the web page to be crawled and the image link field image in the crawled web page\_ urls,items.py code is as follows:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class SpItem(scrapy.Item): """ definition item field """ # Page name image = scrapy.Field() # Picture links in web pages image_urls = scrapy.Field()
4.3 spider
Our reptile FTX Py code is as follows:
# -*- coding: utf-8 -*- import scrapy from sp.items import SpItem class MyBlog(scrapy.Spider): name = 'ftx' start_urls = ['http://esf.fang.com'] def parse(self, response): """ Crawling initial start_urls List item url(Relative path),adopt response.follow generate request,Pass back function as parameter parse_item """ # Get the link address (relative address) of all second-hand houses on the home page page_urls = response.css("p.title a::attr(href)").extract() for page_url in page_urls: # The relative address uses response Stitching URLs in follow mode request = response.follow(page_url, callback=self.parse_item) # If the connection obtained is an absolute address, use the following method # request = scrapy.Request(page_url, callback=self.parse_item) yield request def parse_item(self, response): """ handle item function :param response: Requested page content :return: item """ # Import item class item = SpItem() # Pictures in each personal page # Image is the title of each detail page, image_urls is the url of the picture in each detail page item['image'] = response.css("div.floatl::text").extract_first().strip() item['image_urls'] = response.css("img.loadimg::attr(data-src)").extract() yield item
4.4 custom image pipeline
Go directly to the code pipelines py:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import scrapy from scrapy.pipelines.images import ImagesPipeline class SpDownimagePipeline(ImagesPipeline): """ Custom picture download class """ def get_media_requests(self, item, info): """ ImagesPipeline Class method,Each image must be returned URL of Request :param item:Acquired item """ # Get the picture url from item and send the request, image_urls is items Fields defined in PY for image_url in item['image_urls']: # The function of meta is to pass the value of item to the next function for use, which is similar to caching first yield scrapy.Request(image_url, meta={'item': item}) def item_completed(self, results, item, info): """ There is no change here, just ImagesPipeline The method must be returned item """ if isinstance(item, dict) or self.images_result_field in item.fields: item[self.images_result_field] = [x for ok, x in results if ok] return item def file_path(self, request, response=None, info=None): """ file_path by ImagePipeline The built-in method. Here we rewrite this method, In order to customize the name of the picture, if it is not rewritten, SHA1 hash Format, similar full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg """ # Get item from get_media_requests from the Request item = request.meta['item'] # The name of the picture. After the first version is split ('/'), the last value is - 1. Here, - 1 is not used because the last field of the picture is not a random number # Is the length multiplied by the width, e.g. 452x340c Jpg, easy to duplicate, so use - 2, the penultimate field image_guid = request.url.split('/')[-2] + '.jpg' # Full name, including path and picture name fullname = "full/%s/%s" % (item['image'], image_guid) return fullname
The two get methods for imagespipline\_ media\_ Requests and items\_ Completed here is an explanation:
get\_media\_requests(item, info)
pipeline will get the urls of image and download it from item, so we can rewrite get\_media\_requests method and returns the request of each url:
def get_media_requests(self, item, info): for file_url in item['file_urls']: yield scrapy.Request(file_url)
These requests will be processed by pipeline. When the download is completed, the results will be sent to item in the form of 2-element tuples\_ For the completed method, each tuple will contain (success, file\_info\_or\_error), similar to the following form:
[(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))]
success: Boolean value. It returns True if the picture is downloaded successfully. It returns False if the picture fails to be downloaded. file\_info\_or\_error: a dictionary is returned, including url, path and checksum. If there is a problem, it returns Twisted Failure.
- The url represents where the file was downloaded from, which is from get\_media\_requests returns the url of the request
- Path represents the file storage path
- checksum represents MD5 hash of image content
item\_completed(results, item, info)
When all picture requests in a single project are completed (download completed or download failed), this method will be called, and the results parameter is get\_media\_requests is the result returned after downloading, item\_completed must return the pipeline whose output is sent to the next stage. So you must return or delete the item, just like other pipeline operations before.
In the following example, we store the downloaded file path (passed in results) in file\_ In the path item field, delete the item if it does not contain any files.
from scrapy.exceptions import DropItem def item_completed(self, results, item, info): file_paths = [x['path'] for ok, x in results if ok] if not file_paths: raise DropItem("Item contains no files") item['file_paths'] = file_paths return item
The following is a complete example of a custom Image pipeline:
import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item