preface
There are many built-in commands in the script, but sometimes we want to customize some commands, because writing a script is not as convenient as typing a command.
However, there is no document for the custom command on the scrap website. There is only one sentence: you can also use the commands_ The module setting adds a custom project command. For an example of how to implement a command, see the scratch command in scratch / commands. Let's see the source code.
Script start crawler
In the source code, let's see how to start a crawler in a python file instead of using the crawler XXX. See the example:
import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition ... process = CrawlerProcess(settings={ 'FEED_FORMAT': 'json', 'FEED_URI': 'items.json' }) process.crawl(MySpider) #process.crawl(MySpider1) can run multiple and simultaneously process.start()
Here, the parameter of CrawlerProcess is the configuration when the crawler starts, which should be similar to the parameter after the summary crawler XXX - O.
You can also use crawler runner to implement
from twisted.internet import reactor import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider(scrapy.Spider): ... configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) runner = CrawlerRunner() d = runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) # Turn off twisted reactor # d1 = runner.crawl(MySpider1) # d1.addBoth(lambda _: reactor.stop()) # It can also be written as follows: # runner.crawl(MySpider) # runner.crawl(MySpider1) # d = runner.join() # d.addBoth(lambda _: reactor.stop()) reactor.run()
If you don't want to run at the same time, it's like a run (in addition, the effect of... In the code is similar to pass):
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run()
Custom commands
crawlall
First look at the most used command, run all the crawler files crawlall.py
# -*- coding: utf-8 -*- from scrapy.commands import ScrapyCommand class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
This in Baidu casually search came out, and basically the code will not change.
stay settings.py Create a folder of commands under the same level directory, and crawlall.py Under the folder, then settings.py Add commands to_ MODULE = " newspider.commands "Where newspider is the project name of the summary, and commands is the directory we created
You can then run all the crawlers on the command line using the scrape crawler
Hypothetical demand
Suppose there's a requirement now: write a general crawler to grab some static web pages, and other people can do the data analysis part, but the other people here don't understand the definition of scratch, they are only responsible for writing xpath and regular.
It should be very simple for such a crawler to use the scrape. The first thought is to give them a template file and ask them to modify some content that needs to be modified, but the actual operation may not be smooth, because even if only part of the content needs to be modified, "others" Seeing so many codes will also say that I won't be able to do anything, which leads to the task can't be carried out, and if they accidentally move the relevant codes, they don't know and are not easy to manage.
What shall I do? You can simplify the template file, remove the code part and leave only the content dictionary that needs to be modified (of course, it's not the same as the dictionary, some agreed format is OK, the dictionary is just convenient for management). For example: {'title': ['/ / title/text()',]}, so it looks simple, just let them pay attention to the pairs of brackets and commas. Then we just need to create crawler files based on these, but new problems arise again. How do they test whether their xpath is right? We can't test it and rewrite it for them. It's too inefficient.
Finally leads to the main topic, there are two ways, one is the above custom script, the other is the custom command. Although the custom script is simpler, in order to show how to use the custom command, it is better to use the custom command.
Command effect: according to the dictionary file to grab the relevant content, you can create a crawler file according to the template file and dictionary file, and then run the crawler to achieve the effect.
The effect is like a combination of a crawler created from a template and a crawler run. So we can directly look at the source code of these two commands. The code will not be put up for a long time. You can go to the local file (if it is anaconda, Anaconda \ lib \ site packages \ scratch \ commands
Inside)
After reading it, I found that the genspider command uses string.Template This method is also easy to use to create crawler files, which is actually equivalent to format
import string a = '$a dadafsfas $b' d = {'a':1, 'b': 'dsada'} new_a = string.Template(a).substitute(d)
After reading the runspider code, our command can write as follows:
import sys import os import json import string import logging from importlib import import_module from scrapy.utils.spider import iter_spider_classes from scrapy.commands import ScrapyCommand from scrapy.exceptions import UsageError logger = logging.getLogger(__name__) def create_spider(setting_rule, fname): d = { 'spidername': fname, 'title': setting_rule.get('title') } with open('../tempspider.py', 'r', encoding='utf-8') as f: tempstr = f.read() with open(f'../spiders/{fname}_spider.py', 'w', encoding='utf-8') as fw: fw.write(string.Template(tempstr).substitute(d).replace('true', 'True').replace('false', 'False').replace('null', 'None')) def _import_file(filepath): abspath = os.path.abspath(filepath) dirname, file = os.path.split(abspath) logging.info(dirname) fname, fext = os.path.splitext(file) if fext != '.py': raise ValueError("Not a Python source file: %s" % abspath) if dirname: sys.path = [dirname] + sys.path try: module = import_module(fname) except Exception as e: logger.error('Template file may have syntax error, please check and try again!(%s)' % str(e)) else: create_spider(module.setting_rule, fname) sys.path = [dirname+'/../spiders'] + sys.path spider_module = import_module(f'{fname}_spider') return spider_module finally: if dirname: sys.path.pop(0) sys.path.pop(0) class Command(ScrapyCommand): requires_project = True def syntax(self): return "<spider_file>" def short_desc(self): return "Run a self-contained spider (without creating a project)" def run(self, args, opts): if len(args) != 1: raise UsageError() filename = args[0] if not os.path.exists(filename): raise UsageError("File not found: %s\n" % filename) try: spider_module = _import_file(filename) except (ImportError, ValueError) as e: raise UsageError("Unable to load %r: %s\n" % (filename, e)) spclasses = list(iter_spider_classes(spider_module)) if not spclasses: raise UsageError("No spider found in file: %s\n" % filename) spidercls = spclasses.pop() self.crawler_process.crawl(spidercls, **opts.__dict__) self.crawler_process.start() if self.crawler_process.bootstrap_failed: self.exitcode = 1
How can the code look so complicated? Because I copied it directly runspider.py In fact, the core code for runspider to run the crawler is just a few sentences:
from importlib import import_module from scrapy.utils.spider import iter_spider_classes spider_module = import_module(Module name) # Import crawler module # Return the iterator of the crawler class in the module, that is, as long as the crawler class is removed, some redundant functions and variables will be removed spclasses = list(iter_spider_classes(spider_module)) spidercls = spclasses.pop() # Because there's only one reptile self.crawler_process.crawl(spidercls, **opts.__dict__) self.crawler_process.start() # function
We write the top command code to the test.py And put it in the commands directory, and then the dictionary of the summary test template. Py can test whether the dictionary written can parse the data. In order to be more appropriate to the requirements of the hypothesis, we can also change the log system of the summary to make the log output look more human rather than more procedural.
Warm tip: the above code is only for reference. It may cause errors when running. Most likely, it's because of directory processing. I don't know how to handle the directory more reasonably yet. It seems that the direct + '/' of the parent directory is not elegant.