Script custom command

Posted by wolfcry044 on Tue, 19 May 2020 10:00:43 +0200

preface

There are many built-in commands in the script, but sometimes we want to customize some commands, because writing a script is not as convenient as typing a command.

However, there is no document for the custom command on the scrap website. There is only one sentence: you can also use the commands_ The module setting adds a custom project command. For an example of how to implement a command, see the scratch command in scratch / commands. Let's see the source code.

Script start crawler

In the source code, let's see how to start a crawler in a python file instead of using the crawler XXX. See the example:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...
process = CrawlerProcess(settings={
    'FEED_FORMAT': 'json',
    'FEED_URI': 'items.json'
})
process.crawl(MySpider)
#process.crawl(MySpider1) can run multiple and simultaneously
process.start()

Here, the parameter of CrawlerProcess is the configuration when the crawler starts, which should be similar to the parameter after the summary crawler XXX - O.

You can also use crawler runner to implement

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop()) # Turn off twisted reactor
# d1 = runner.crawl(MySpider1)
# d1.addBoth(lambda _: reactor.stop())
# It can also be written as follows:
# runner.crawl(MySpider)
# runner.crawl(MySpider1)
# d = runner.join()
# d.addBoth(lambda _: reactor.stop())
reactor.run() 

If you don't want to run at the same time, it's like a run (in addition, the effect of... In the code is similar to pass):

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run()

Custom commands

crawlall

First look at the most used command, run all the crawler files crawlall.py

# -*- coding: utf-8 -*-
from scrapy.commands import ScrapyCommand

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

This in Baidu casually search came out, and basically the code will not change.
stay settings.py Create a folder of commands under the same level directory, and crawlall.py Under the folder, then settings.py Add commands to_ MODULE = " newspider.commands "Where newspider is the project name of the summary, and commands is the directory we created
You can then run all the crawlers on the command line using the scrape crawler

Hypothetical demand

Suppose there's a requirement now: write a general crawler to grab some static web pages, and other people can do the data analysis part, but the other people here don't understand the definition of scratch, they are only responsible for writing xpath and regular.

It should be very simple for such a crawler to use the scrape. The first thought is to give them a template file and ask them to modify some content that needs to be modified, but the actual operation may not be smooth, because even if only part of the content needs to be modified, "others" Seeing so many codes will also say that I won't be able to do anything, which leads to the task can't be carried out, and if they accidentally move the relevant codes, they don't know and are not easy to manage.

What shall I do? You can simplify the template file, remove the code part and leave only the content dictionary that needs to be modified (of course, it's not the same as the dictionary, some agreed format is OK, the dictionary is just convenient for management). For example: {'title': ['/ / title/text()',]}, so it looks simple, just let them pay attention to the pairs of brackets and commas. Then we just need to create crawler files based on these, but new problems arise again. How do they test whether their xpath is right? We can't test it and rewrite it for them. It's too inefficient.

Finally leads to the main topic, there are two ways, one is the above custom script, the other is the custom command. Although the custom script is simpler, in order to show how to use the custom command, it is better to use the custom command.

Command effect: according to the dictionary file to grab the relevant content, you can create a crawler file according to the template file and dictionary file, and then run the crawler to achieve the effect.

The effect is like a combination of a crawler created from a template and a crawler run. So we can directly look at the source code of these two commands. The code will not be put up for a long time. You can go to the local file (if it is anaconda, Anaconda \ lib \ site packages \ scratch \ commands
Inside)

After reading it, I found that the genspider command uses string.Template This method is also easy to use to create crawler files, which is actually equivalent to format

import string
a = '$a dadafsfas $b'
d = {'a':1, 'b': 'dsada'}
new_a = string.Template(a).substitute(d)

After reading the runspider code, our command can write as follows:

import sys
import os
import json
import string
import logging
from importlib import import_module
from scrapy.utils.spider import iter_spider_classes
from scrapy.commands import ScrapyCommand
from scrapy.exceptions import UsageError


logger = logging.getLogger(__name__)

def create_spider(setting_rule, fname):
    d = { 
            'spidername': fname, 'title': setting_rule.get('title')
        }
    with open('../tempspider.py', 'r', encoding='utf-8') as f:
        tempstr = f.read()
    with open(f'../spiders/{fname}_spider.py', 'w', encoding='utf-8') as fw:
        fw.write(string.Template(tempstr).substitute(d).replace('true', 'True').replace('false', 'False').replace('null', 'None'))

def _import_file(filepath):
    abspath = os.path.abspath(filepath)
    dirname, file = os.path.split(abspath)
    logging.info(dirname) 
    fname, fext = os.path.splitext(file)
    if fext != '.py':
        raise ValueError("Not a Python source file: %s" % abspath)
    if dirname:
        sys.path = [dirname] + sys.path
    try:
        module = import_module(fname)
    except Exception as e:
        logger.error('Template file may have syntax error, please check and try again!(%s)' % str(e))
    else:
        create_spider(module.setting_rule, fname)
        sys.path = [dirname+'/../spiders'] + sys.path
        spider_module = import_module(f'{fname}_spider')
        return spider_module
    finally:
        if dirname:
            sys.path.pop(0)
            sys.path.pop(0)


class Command(ScrapyCommand):

    requires_project = True
    
    def syntax(self):
        return "<spider_file>"

    def short_desc(self):
        return "Run a self-contained spider (without creating a project)"

    
    def run(self, args, opts):
        if len(args) != 1:
            raise UsageError()
        filename = args[0]
        if not os.path.exists(filename):
            raise UsageError("File not found: %s\n" % filename)
        try:
            spider_module = _import_file(filename)
        except (ImportError, ValueError) as e:
            raise UsageError("Unable to load %r: %s\n" % (filename, e))

        spclasses = list(iter_spider_classes(spider_module))
        if not spclasses:
            raise UsageError("No spider found in file: %s\n" % filename)
        spidercls = spclasses.pop()

        self.crawler_process.crawl(spidercls, **opts.__dict__)
        self.crawler_process.start()

        if self.crawler_process.bootstrap_failed:
            self.exitcode = 1

How can the code look so complicated? Because I copied it directly runspider.py In fact, the core code for runspider to run the crawler is just a few sentences:

from importlib import import_module
from scrapy.utils.spider import iter_spider_classes

spider_module = import_module(Module name) # Import crawler module
# Return the iterator of the crawler class in the module, that is, as long as the crawler class is removed, some redundant functions and variables will be removed
spclasses = list(iter_spider_classes(spider_module))
spidercls = spclasses.pop() # Because there's only one reptile
self.crawler_process.crawl(spidercls, **opts.__dict__)
self.crawler_process.start() # function

We write the top command code to the test.py And put it in the commands directory, and then the dictionary of the summary test template. Py can test whether the dictionary written can parse the data. In order to be more appropriate to the requirements of the hypothesis, we can also change the log system of the summary to make the log output look more human rather than more procedural.

Warm tip: the above code is only for reference. It may cause errors when running. Most likely, it's because of directory processing. I don't know how to handle the directory more reasonably yet. It seems that the direct + '/' of the parent directory is not elegant.

Topics: Python Lambda JSON Anaconda