1 create project
Install the required environment python version needs > = 3.6
Installation command:
pip install asyncpy
After the installation is complete, you can start creating a crawler project.
Create project command:
asyncpy genspider demo
Create a project called demo.
After the creation is successful, open the project file. The project structure is as follows:
2 send get and post requests
2.1 using start_urls send GET request
At start_ Add a link to the URL list.
In parse, print out the response status code and content.
# -*- coding: utf-8 -*- from asyncpy.spider import Spider import settings class DemoSpider(Spider): name = 'demo' settings_attr = settings start_urls = ['http://httpbin.org/get'] async def parse(self, response): print(response.status) print(response.text) DemoSpider.start()
Right click Run to complete the request grabbing.
2.2 using start_requests send POST requests
Import Asyncpy's Request module and clear start_urls, then rewrite start_ The requests method completes the Post Request.
from asyncpy.spider import Spider import settings from asyncpy.spider import Request class DemoSpider(Spider): name = 'demo' settings_attr = settings start_urls = [] async def start_requests(self): url = 'http://httpbin.org/post' yield Request(callback=self.parse,url=url,method="POST",data={"Say":"Hello Asyncpy"}) async def parse(self, response): print(response.status) print(response.text) DemoSpider.start()
Response results:
You can see the parameters submitted by our Post.
3 custom request header
Here we take the user agent in the modification request header as an example. From the above figure, we can see that the current user agent is the default user agent of aiohttp.
3.1 setting request header in settings
Open the settings file and find the lowest USER_AGENT parameter. After uncovering the comment, add a browser UA.
3.2 add request headers in middlewars
Open the middlewares file, find the UserAgentMiddleware method (available by default), or customize a method.
# -*- coding: utf-8 -*- from asyncpy.middleware import Middleware from asyncpy.request import Request from asyncpy.spider import Spider middleware = Middleware() @middleware.request async def UserAgentMiddleware(spider:Spider, request: Request): ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36" request.headers.update({"User-Agent": ua})
Then go to the spider crawler file( demo.py ), introduce Middleware in middlewares file. Pass in middleware in the start start method.
# -*- coding: utf-8 -*- from asyncpy.spider import Spider import settings from asyncpy.spider import Request from middlewares import middleware class DemoSpider(Spider): name = 'demo' settings_attr = settings start_urls = [] async def start_requests(self): url = 'http://httpbin.org/post' yield Request(callback=self.parse,url=url,method="POST",data={"Say":"Hello Asyncpy"}) async def parse(self, response): print(response.text) DemoSpider.start(middleware=middleware)
function demo.py , you can see that the current user agent has been changed to our customized UA.
3.3 add proxy IP
Similar to 3.1.1, open the middlewars file and add the agent under the method. (you can redefine a method)
Notice it's at aiohttp_ Add a proxy to kwargs. Remember to pass in middleware in the start method.
@middleware.request async def UserAgentMiddleware(spider:Spider, request: Request): ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36" request.headers.update({"User-Agent": ua}) request.aiohttp_kwargs.update({"proxy": "http://49.85.98.209:4253"})
function demo.py , you can see that the current ip has been changed to our custom proxy. (the proxy ip has expired)
4. Modify concurrent delay retry and other configurations
4.1 modify the configuration in settings
The settings file contains the following supported configurations, which can be modified by yourself.
""" CREATE YOUR DEFAULT_CONFIG ! Some configuration: CONCURRENT_REQUESTS Concurrent quantity RETRIES retry count DOWNLOAD_DELAY Download delay RETRY_DELAY Retry delay DOWNLOAD_TIMEOUT Timeout limit USER_AGENT user agent LOG_FILE Log path LOG_LEVEL Log level """
4.2 modify the configuration of the specified crawler file
If you need to configure different crawler files differently, you can use custom_settings customize the configuration in the crawler file.
And you need to pass in custom in yield_ settings. For the custom configuration to take effect.
# -*- coding: utf-8 -*- from asyncpy.spider import Spider import settings from asyncpy.spider import Request from middlewares import middleware class DemoSpider(Spider): name = 'demo' settings_attr = settings custom_settings = { "DOWNLOAD_TIMEOUT":60, "RETRIES":3 } start_urls = [] async def start_requests(self): url = 'http://httpbin.org/post' yield Request(callback=self.parse,url=url, method="POST", data={"Say":"Hello Asyncpy"}, custom_settings=self.custom_settings ) async def parse(self, response): print(response.text) DemoSpider.start(middleware=middleware)
5 generate log file
5.1 modify settings configuration
# '' generate log file '' # LOG_FILE = '../asyncpy.log' # LOG_LEVEL = 'DEBUG'
The global log can be configured to generate in the settings file.
5.2 multiple crawler specified log files
custom_settings = { "LOG_FILE" : "../asyncpy.log" }
This is the same as above. For the log of the specified crawler file, you need to delete the log configuration of settings, and then click custom_settings to configure if log is not specified_ For level, the log level is INFO by default
6 parsing response to extract data
I introduced the parsel parsing module in the summary, so the default parsing method here is the same as the summary.
You can also choose other methods by yourself.
async def parse(self, response): print(response.text) print(response.xpath('//text()')) print(response.css(''))
6.1 response.text
Return the text content of the page, which can be imported into the regular module for matching
6.2 response.xpath('')
- getall(): returns a list containing multiple string s
- get(): string returned, the first string in the list
- extract() equals getall()
- extract_first() equals get()
6.3 response.css(),response.re()
parsel.css Selectors and parsel.re , you can find out how to use it.
7 save data with pipelines
Use the callback method to determine whether the yield item is a dict type item. If yes, use pipelines to save the data.
- First define an item, and then use yield to call back the item.
- There is a SpiderPipeline class by default in the pipelines file of the project. Import SpiderPipeline and pass in start() to start the pipeline.
# -*- coding: utf-8 -*- # Crawler file from asyncpy.spider import Spider import settings from asyncpy.spider import Request from middlewares import middleware from pipelines import SpiderPipeline class DemoSpider(Spider): name = 'demo' settings_attr = settings start_urls = [] async def start_requests(self): url = 'http://httpbin.org/post' yield Request(callback=self.parse,url=url, method="POST", data={"Say":"Hello Asyncpy"}, custom_settings=self.custom_settings ) async def parse(self, response): item = {} item['text'] = response.text yield item DemoSpider.start(middleware=middleware,pipelines=SpiderPipeline)
# -*- coding: utf-8 -*- # pipelines file class SpiderPipeline(): def __init__(self): pass def process_item(self, item, spider_name): print(item)
8 start multiple Crawlers
Currently, multiple crawler files can be started in a multi process way.
Create a test file, import the spiders of two crawler files, and start with multiprocessing.
from Demo.demo import DemoSpider from Demo.demo2 import DemoSpider2 import multiprocessing def open_DemoSpider2(): DemoSpider2.start() def open_DemoSpider(): DemoSpider.start() if __name__ == "__main__": p1 = multiprocessing.Process(target = open_DemoSpider) p2 = multiprocessing.Process(target = open_DemoSpider2) p1.start() p2.start()
You can also go to github and ask for star support!