Log information and configuration of scripy

Posted by PeeJay on Tue, 07 Dec 2021 06:36:15 +0100

1. Understand the log information of the sweep

(omitted)

2. Common configuration of scratch

ROBOTSTXT_ Whether obey complies with the robots protocol. The default is compliance

About robots protocol
1. In Baidu search, you can't search the details page of a specific commodity on Taobao. This is why the robots protocol works

2.Robots protocol: the website tells the search engine which pages can be crawled and which pages cannot be crawled through robots protocol, but it is only a general agreement in the Internet

3. For example, Taobao's robots protocol: https://www.taobao.com/robots.txt

General closing not observed, note:

#ROBOTSTXT_OBEY = True

USER_AGENT settings ua

DEFAULT_REQUEST_HEADERS sets the default request header, and user is added here_ Agent will not work

ITEM_PIPELINES pipeline, left position right weight: the smaller the weight value, the more priority will be given to execution

SPIDER_ The setting process of the middlewares crawler middleware is the same as that of the pipeline

DOWNLOADER_MIDDLEWARES download Middleware

COOKIES_ENABLED is True by default, which means that the cookie delivery function is enabled, that is, the previous cookie is brought with each request to maintain the status

COOKIES_DEBUG defaults to False, indicating that the delivery process of cookie s is not displayed in the log

LOG_LEVEL is DEBUG by default, which controls the level of logs
a.LOG_LEVEL = "WARNING"

LOG_FILE sets the save path of the log file. If this parameter is set, the log information will be written to the file, the terminal will no longer display, and will be affected by log_ Limit of level log level
b.LOG_FILE = "./test.log"

3. scrapy_redis configuration

1.DUPEFILTER_CLASS = "scratch_redis. Dupefilter. Rfpdupefilter" # fingerprint generation and de duplication class

2.SCHEDULER = "scenario_redis. Scheduler. Scheduler" # scheduler class

3.SCHEDULER_PERSIST = True # persistent request queue and fingerprint collection

4.ITEM_PIPELINES = {'scratch_redis. Pipelines. Redispipeline': 400} # data stored in redis pipeline

5.REDIS_URL = “ redis://host:port ”#Redis url

4. scrapy_splash configuration

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

5. scrapy_redis and scratch_ Splash configuration

5.1 principle

1. Dupefilter is configured in scratch redis_ Class ":" scratch_redis.dupefilter.rfpdupefilter ", which is the same as dupefilter configured for scratch splash_ CLASS = ‘scrapy_splash.SplashAwareDupeFilter 'conflicts!

2. Checked the script_ After the source code of splash.splash awaredupefilter, it is found that it inherits the script.dupefilter.rfpdupefilter and rewrites the request_fingerprint() method.

3. Compare sweep.dupefilter.rfpdupefilter and sweep_ Request in redis.dupefilter.rfpdupefilter_ After the fingerprint () method, it is found that it is the same, so a SplashAwareDupeFilter is rewritten to inherit the scratch_ Redis.dupefilter.rfpdupefilter, other codes remain unchanged.

5.2 rewrite dupefilter to remove duplicate classes and use it in settings.py

5.2.1 override de duplication

from __future__ import absolute_import
from copy import deepcopy
from scrapy.utils.request import request_fingerprint
from scrapy.utils.url import canonicalize_url
from scrapy_splash.utils import dict_hash
from scrapy_redis.dupefilter import RFPDupeFilter


def splash_request_fingerprint(request, include_headers=None):
    """ Request fingerprint which takes 'splash' meta key into account """
    fp = request_fingerprint(request, include_headers=include_headers)
    if 'splash' not in request.meta:
        return fp
    splash_options = deepcopy(request.meta['splash'])
    args = splash_options.setdefault('args', {})
    if 'url' in args:
        args['url'] = canonicalize_url(args['url'], keep_fragments=True)
    return dict_hash(splash_options, fp)
class SplashAwareDupeFilter(RFPDupeFilter):
    """
    DupeFilter that takes 'splash' meta key in account.
    It should be used with SplashMiddleware.
    """
    
def request_fingerprint(self, request):
        return splash_request_fingerprint(request)
"""The above is the rewritten de duplication class, and the following is the crawler code"""
from scrapy_redis.spiders import RedisSpider
from scrapy_splash import SplashRequest
class SplashAndRedisSpider(RedisSpider):
    name = 'splash_and_redis'
    allowed_domains = ['baidu.com']
    # start_urls = ['https://www.baidu.com/s?wd=13161933309']
    redis_key = 'splash_and_redis'
    # lpush splash_and_redis 'https://www.baidu.com'
    # Distributed initial url cannot use splash service!
    # You need to override dupefilter to duplicate classes!
    def parse(self, response):
        yield SplashRequest('https://www.baidu.com/s?wd=13161933309',
                            callback=self.parse_splash,
                            args={'wait': 10}, # Maximum timeout in seconds
                            endpoint='render.html') # Fixed parameters using splash service
    def parse_splash(self, response):
        with open('splash_and_redis.html', 'w') as f:
            f.write(response.body.decode())

5.2.2 scrapy_redis and scratch_ Splash configuration

# url of the rendering service
SPLASH_URL = 'http://127.0.0.1:8050'

# Http cache using Splash
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# De duplication filter
# DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Fingerprint generation and de duplication
DUPEFILTER_CLASS = 'test_splash.spiders.splash_and_redis.SplashAwareDupeFilter' # Location of mixed de duplication classes
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Scheduler class
SCHEDULER_PERSIST = True # Persistent request queue and fingerprint collection_ Redis and scratch_ Splash mix use splash's DupeFilter!
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400} # Data is stored in the redis pipeline
REDIS_URL = "redis://127.0.0.1:6379 "# redis url

be careful:
1. The scratch_redis distributed crawler cannot exit automatically after the business logic ends
2. The rewritten dupefilter de duplication class can define the location, and the corresponding path must be written in the configuration file

6. Learn about other configurations of scratch

1.CONCURRENT_REQUESTS sets the number of concurrent requests, which is 16 by default

2.DOWNLOAD_DELAY: no delay by default, in seconds

3. Other settings reference: https://www.jianshu.com/p/df9c0d1e9087

Topics: Python crawler Middleware

Programmer Think