84 crawler-scrapy-redis source code analysis (dupefilter)

Posted by techmeister on Mon, 02 Sep 2019 06:30:14 +0200

Responsible for the implementation of requst de-weighting, the implementation is very skillful, using the set data structure of redis. Note, however, that scheduler does not use the dupefilter key used to implement request s in this module, but uses queue implemented in queue.py module.

When the request is not repeated, it is stored in queue and popped up when scheduling.

import logging
import time

from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint

from .connection import get_redis_from_settings


DEFAULT_DUPEFILTER_KEY = "dupefilter:%(timestamp)s"

logger = logging.getLogger(__name__)


# TODO: Rename class to RedisDupeFilter.
class RFPDupeFilter(BaseDupeFilter):
    """Redis-based request duplicates filter.
    This class can also be used with default Scrapy's scheduler.
    """

    logger = logger

    def __init__(self, server, key, debug=False):
        """Initialize the duplicates filter.
        Parameters
        ----------
        server : redis.StrictRedis
            The redis server instance.
        key : str
            Redis key Where to store fingerprints.
        debug : bool, optional
            Whether to log filtered requests.
        """
        self.server = server
        self.key = key
        self.debug = debug
        self.logdupes = True

    @classmethod
    def from_settings(cls, settings):
        """Returns an instance from given settings.
        This uses by default the key ``dupefilter:<timestamp>``. When using the
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
        it needs to pass the spider name in the key.
        Parameters
        ----------
        settings : scrapy.settings.Settings
        Returns
        -------
        RFPDupeFilter
            A RFPDupeFilter instance.
        """
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy's default scheduler
        # if scrapy passes spider on open() method this wouldn't be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = DEFAULT_DUPEFILTER_KEY % {'timestamp': int(time.time())}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

    @classmethod
    def from_crawler(cls, crawler):
        """Returns instance from crawler.
        Parameters
        ----------
        crawler : scrapy.crawler.Crawler
        Returns
        -------
        RFPDupeFilter
            Instance of RFPDupeFilter.
        """
        return cls.from_settings(crawler.settings)

    def request_seen(self, request):
        """Returns True if request was already seen.
        Parameters
        ----------
        request : scrapy.http.Request
        Returns
        -------
        bool
        """
        fp = self.request_fingerprint(request)
        # This returns the number of values added, zero if already exists.
        added = self.server.sadd(self.key, fp)
        return added == 0

    def request_fingerprint(self, request):
        """Returns a fingerprint for a given request.
        Parameters
        ----------
        request : scrapy.http.Request
        Returns
        -------
        str
        """
        return request_fingerprint(request)

    def close(self, reason=''):
        """Delete data on close. Called by Scrapy's scheduler.
        Parameters
        ----------
        reason : str, optional
        """
        self.clear()

    def clear(self):
        """Clears fingerprints data."""
        self.server.delete(self.key)

    def log(self, request, spider):
        """Logs given request.
        Parameters
        ----------
        request : scrapy.http.Request
        spider : scrapy.spiders.Spider
        """
        if self.debug:
            msg = "Filtered duplicate request: %(request)s"
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
        elif self.logdupes:
            msg = ("Filtered duplicate request %(request)s"
                   " - no more duplicates will be shown"
                   " (see DUPEFILTER_DEBUG to show all duplicates)")
            msg = "Filtered duplicate request: %(request)s"
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
            self.logdupes = False

This file looks complex and rewrites the request revalidation function that scrapy itself has implemented. Because scrapy runs on its own, it only needs to read the request queue in memory or the persistent request queue (scrapy's default persistence seems to be a file in json format, not a database) to determine whether the request url to be issued has been requested or is being scheduled (read locally). For distributed running, scheduler s on each host need to connect to the same request pool of the same database to determine whether the request is repeated.

In this file, by inheriting BaseDupeFilter's method of rewriting it, redis-based weighting is realized. According to the source code, scrapy-redis uses a fingerprint of scrapy itself to connect request_fingerprint. This interface is interesting. According to scrapy documents, he uses hash to determine whether two URLs are the same (the same url generates the same hash result), but when the addresses of two URLs are the same, get parameters are the same. But when the order is different, the same hash result will be generated. (This is really amazing.) So scrapy-redis still uses url fingerprint to determine whether request requests have occurred.

This class uses a key to insert fingerprint into a set of redis by connecting redis (the key is the same for the same spider, redis is a key-value database, and if the key is the same, the accessed value is the same. Here the key with spider name + DupeFilter is for different hosts. Different crawler instances on the set will access the same set as long as they belong to the same spider, and this set is their url weighting pool. If the return value is 0, it means that the fingerprint already exists in the set (because the set has no duplicate value), it will return False. If the return value is 1, it means that a fi has been added. The ngerprint in the set indicates that the request is not duplicated, so it returns to True and adds the new fingerprint to the database by the way. DupeFilter will be used in the scheduler class. Every request must be weighted before it enters the scheduling. If it repeats, it will not need to participate in the scheduling, just abandon it, or it will waste resources in vain.

Topics: Redis Database JSON