Get started with python Programming quickly (continuous update...)
python crawler from entry to mastery
Scrapy crawler framework
1. Understand the log information of the sweep
(omitted)
2. Common configuration of scratch
ROBOTSTXT_ Whether obey complies with the robots protocol. The default is compliance
About robots protocol
1. In Baidu search, you can't search the details page of a specific commodity on Taobao. This is why the robots protocol works
2.Robots protocol: the website tells the search engine which pages can be crawled and which pages cannot be crawled through robots protocol, but it is only a general agreement in the Internet
3. For example, Taobao's robots protocol: https://www.taobao.com/robots.txt
General closing not observed, note:
#ROBOTSTXT_OBEY = True
USER_AGENT settings ua
DEFAULT_REQUEST_HEADERS sets the default request header, and user is added here_ Agent will not work
ITEM_PIPELINES pipeline, left position right weight: the smaller the weight value, the more priority will be given to execution
SPIDER_ The setting process of the middlewares crawler middleware is the same as that of the pipeline
DOWNLOADER_MIDDLEWARES download Middleware
COOKIES_ENABLED is True by default, which means that the cookie delivery function is enabled, that is, the previous cookie is brought with each request to maintain the status
COOKIES_DEBUG defaults to False, indicating that the delivery process of cookie s is not displayed in the log
LOG_LEVEL is DEBUG by default, which controls the level of logs
a.LOG_LEVEL = "WARNING"
LOG_FILE sets the save path of the log file. If this parameter is set, the log information will be written to the file, the terminal will no longer display, and will be affected by log_ Limit of level log level
b.LOG_FILE = "./test.log"
3. scrapy_redis configuration
1.DUPEFILTER_CLASS = "scratch_redis. Dupefilter. Rfpdupefilter" # fingerprint generation and de duplication class
2.SCHEDULER = "scenario_redis. Scheduler. Scheduler" # scheduler class
3.SCHEDULER_PERSIST = True # persistent request queue and fingerprint collection
4.ITEM_PIPELINES = {'scratch_redis. Pipelines. Redispipeline': 400} # data stored in redis pipeline
5.REDIS_URL = “ redis://host:port ”#Redis url
4. scrapy_splash configuration
SPLASH_URL = 'http://127.0.0.1:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
5. scrapy_redis and scratch_ Splash configuration
5.1 principle
1. Dupefilter is configured in scratch redis_ Class ":" scratch_redis.dupefilter.rfpdupefilter ", which is the same as dupefilter configured for scratch splash_ CLASS = ‘scrapy_splash.SplashAwareDupeFilter 'conflicts!
2. Checked the script_ After the source code of splash.splash awaredupefilter, it is found that it inherits the script.dupefilter.rfpdupefilter and rewrites the request_fingerprint() method.
3. Compare sweep.dupefilter.rfpdupefilter and sweep_ Request in redis.dupefilter.rfpdupefilter_ After the fingerprint () method, it is found that it is the same, so a SplashAwareDupeFilter is rewritten to inherit the scratch_ Redis.dupefilter.rfpdupefilter, other codes remain unchanged.
5.2 rewrite dupefilter to remove duplicate classes and use it in settings.py
5.2.1 override de duplication
from __future__ import absolute_import from copy import deepcopy from scrapy.utils.request import request_fingerprint from scrapy.utils.url import canonicalize_url from scrapy_splash.utils import dict_hash from scrapy_redis.dupefilter import RFPDupeFilter def splash_request_fingerprint(request, include_headers=None): """ Request fingerprint which takes 'splash' meta key into account """ fp = request_fingerprint(request, include_headers=include_headers) if 'splash' not in request.meta: return fp splash_options = deepcopy(request.meta['splash']) args = splash_options.setdefault('args', {}) if 'url' in args: args['url'] = canonicalize_url(args['url'], keep_fragments=True) return dict_hash(splash_options, fp) class SplashAwareDupeFilter(RFPDupeFilter): """ DupeFilter that takes 'splash' meta key in account. It should be used with SplashMiddleware. """ def request_fingerprint(self, request): return splash_request_fingerprint(request) """The above is the rewritten de duplication class, and the following is the crawler code""" from scrapy_redis.spiders import RedisSpider from scrapy_splash import SplashRequest class SplashAndRedisSpider(RedisSpider): name = 'splash_and_redis' allowed_domains = ['baidu.com'] # start_urls = ['https://www.baidu.com/s?wd=13161933309'] redis_key = 'splash_and_redis' # lpush splash_and_redis 'https://www.baidu.com' # Distributed initial url cannot use splash service! # You need to override dupefilter to duplicate classes! def parse(self, response): yield SplashRequest('https://www.baidu.com/s?wd=13161933309', callback=self.parse_splash, args={'wait': 10}, # Maximum timeout in seconds endpoint='render.html') # Fixed parameters using splash service def parse_splash(self, response): with open('splash_and_redis.html', 'w') as f: f.write(response.body.decode())
5.2.2 scrapy_redis and scratch_ Splash configuration
# url of the rendering service SPLASH_URL = 'http://127.0.0.1:8050' # Http cache using Splash HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' # De duplication filter # DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Fingerprint generation and de duplication DUPEFILTER_CLASS = 'test_splash.spiders.splash_and_redis.SplashAwareDupeFilter' # Location of mixed de duplication classes SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Scheduler class SCHEDULER_PERSIST = True # Persistent request queue and fingerprint collection_ Redis and scratch_ Splash mix use splash's DupeFilter! ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400} # Data is stored in the redis pipeline REDIS_URL = "redis://127.0.0.1:6379 "# redis url
be careful:
1. The scratch_redis distributed crawler cannot exit automatically after the business logic ends
2. The rewritten dupefilter de duplication class can define the location, and the corresponding path must be written in the configuration file
6. Learn about other configurations of scratch
1.CONCURRENT_REQUESTS sets the number of concurrent requests, which is 16 by default
2.DOWNLOAD_DELAY: no delay by default, in seconds
3. Other settings reference: https://www.jianshu.com/p/df9c0d1e9087