Asyncpy use documentation

Posted by xnor82 on Thu, 11 Jun 2020 05:39:52 +0200

1 create project

Install the required environment python version needs > = 3.6

Installation command:

pip install asyncpy

After the installation is complete, you can start creating a crawler project.

Create project command:

asyncpy genspider demo

Create a project called demo.

After the creation is successful, open the project file. The project structure is as follows:

2 send get and post requests

2.1 using start_urls send GET request

At start_ Add a link to the URL list.
In parse, print out the response status code and content.

# -*- coding: utf-8 -*-

from asyncpy.spider import Spider
import settings


class DemoSpider(Spider):
    name = 'demo'
    settings_attr = settings

    start_urls = ['http://httpbin.org/get']

    async def parse(self, response):
        print(response.status)
        print(response.text)

DemoSpider.start()

Right click Run to complete the request grabbing.

2.2 using start_requests send POST requests

Import Asyncpy's Request module and clear start_urls, then rewrite start_ The requests method completes the Post Request.

from asyncpy.spider import Spider
import settings
from asyncpy.spider import Request

class DemoSpider(Spider):
    name = 'demo'
    settings_attr = settings

    start_urls = []

    async def start_requests(self):
        url = 'http://httpbin.org/post'
        yield Request(callback=self.parse,url=url,method="POST",data={"Say":"Hello Asyncpy"})

    async def parse(self, response):
        print(response.status)
        print(response.text)

DemoSpider.start()

Response results:
You can see the parameters submitted by our Post.

3 custom request header

Here we take the user agent in the modification request header as an example. From the above figure, we can see that the current user agent is the default user agent of aiohttp.

3.1 setting request header in settings

Open the settings file and find the lowest USER_AGENT parameter. After uncovering the comment, add a browser UA.

3.2 add request headers in middlewars

Open the middlewares file, find the UserAgentMiddleware method (available by default), or customize a method.

# -*- coding: utf-8 -*-
from asyncpy.middleware import Middleware
from asyncpy.request import Request
from asyncpy.spider import Spider

middleware = Middleware()

@middleware.request
async def UserAgentMiddleware(spider:Spider, request: Request):
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36"
    request.headers.update({"User-Agent": ua})

Then go to the spider crawler file( demo.py ), introduce Middleware in middlewares file. Pass in middleware in the start start method.

# -*- coding: utf-8 -*-

from asyncpy.spider import Spider
import settings
from asyncpy.spider import Request
from middlewares import middleware

class DemoSpider(Spider):
    name = 'demo'
    settings_attr = settings

    start_urls = []

    async def start_requests(self):
        url = 'http://httpbin.org/post'
        yield Request(callback=self.parse,url=url,method="POST",data={"Say":"Hello Asyncpy"})

    async def parse(self, response):
        print(response.text)

DemoSpider.start(middleware=middleware)

function demo.py , you can see that the current user agent has been changed to our customized UA.

3.3 add proxy IP

Similar to 3.1.1, open the middlewars file and add the agent under the method. (you can redefine a method)
Notice it's at aiohttp_ Add a proxy to kwargs. Remember to pass in middleware in the start method.

@middleware.request
async def UserAgentMiddleware(spider:Spider, request: Request):
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36"
    request.headers.update({"User-Agent": ua})
    request.aiohttp_kwargs.update({"proxy": "http://49.85.98.209:4253"})

function demo.py , you can see that the current ip has been changed to our custom proxy. (the proxy ip has expired)

4. Modify concurrent delay retry and other configurations

4.1 modify the configuration in settings

The settings file contains the following supported configurations, which can be modified by yourself.

"""
CREATE YOUR DEFAULT_CONFIG !

Some configuration:
        CONCURRENT_REQUESTS     Concurrent quantity
        RETRIES                 retry count
        DOWNLOAD_DELAY          Download delay
        RETRY_DELAY             Retry delay
        DOWNLOAD_TIMEOUT        Timeout limit
        USER_AGENT              user agent 
        LOG_FILE                Log path
        LOG_LEVEL               Log level
"""

4.2 modify the configuration of the specified crawler file

If you need to configure different crawler files differently, you can use custom_settings customize the configuration in the crawler file.
And you need to pass in custom in yield_ settings. For the custom configuration to take effect.

# -*- coding: utf-8 -*-

from asyncpy.spider import Spider
import settings
from asyncpy.spider import Request
from middlewares import middleware

class DemoSpider(Spider):
    name = 'demo'
    settings_attr = settings
    custom_settings = {
        "DOWNLOAD_TIMEOUT":60,
        "RETRIES":3
    }

    start_urls = []

    async def start_requests(self):
        url = 'http://httpbin.org/post'
        yield Request(callback=self.parse,url=url,
                      method="POST",
                      data={"Say":"Hello Asyncpy"},
                      custom_settings=self.custom_settings
                      )

    async def parse(self, response):
        print(response.text)

DemoSpider.start(middleware=middleware)

5 generate log file

5.1 modify settings configuration

# '' generate log file ''
# LOG_FILE = '../asyncpy.log'
# LOG_LEVEL = 'DEBUG'

The global log can be configured to generate in the settings file.

5.2 multiple crawler specified log files

    custom_settings = {
        "LOG_FILE" : "../asyncpy.log"
    }

This is the same as above. For the log of the specified crawler file, you need to delete the log configuration of settings, and then click custom_settings to configure if log is not specified_ For level, the log level is INFO by default

6 parsing response to extract data

I introduced the parsel parsing module in the summary, so the default parsing method here is the same as the summary.
You can also choose other methods by yourself.

    async def parse(self, response):
        print(response.text)
        print(response.xpath('//text()'))
        print(response.css(''))

6.1 response.text

Return the text content of the page, which can be imported into the regular module for matching

6.2 response.xpath('')

  • getall(): returns a list containing multiple string s
  • get(): string returned, the first string in the list
  • extract() equals getall()
  • extract_first() equals get()

6.3 response.css(),response.re()

parsel.css Selectors and parsel.re , you can find out how to use it.

7 save data with pipelines

Use the callback method to determine whether the yield item is a dict type item. If yes, use pipelines to save the data.

    • First define an item, and then use yield to call back the item.
    • There is a SpiderPipeline class by default in the pipelines file of the project. Import SpiderPipeline and pass in start() to start the pipeline.
    # -*- coding: utf-8 -*-
    # Crawler file
    from asyncpy.spider import Spider
    import settings
    from asyncpy.spider import Request
    from middlewares import middleware
    from pipelines import SpiderPipeline
    
    class DemoSpider(Spider):
        name = 'demo'
        settings_attr = settings
        start_urls = []
    
        async def start_requests(self):
            url = 'http://httpbin.org/post'
            yield Request(callback=self.parse,url=url,
                          method="POST",
                          data={"Say":"Hello Asyncpy"},
                          custom_settings=self.custom_settings
                          )
    
        async def parse(self, response):
            item = {}
            item['text'] = response.text
            yield item
    DemoSpider.start(middleware=middleware,pipelines=SpiderPipeline)
    # -*- coding: utf-8 -*-
    # pipelines file
    class SpiderPipeline():
        def __init__(self):
            pass
    
        def process_item(self, item, spider_name):
            print(item)

    8 start multiple Crawlers

    Currently, multiple crawler files can be started in a multi process way.
    Create a test file, import the spiders of two crawler files, and start with multiprocessing.

    from Demo.demo import DemoSpider
    from Demo.demo2 import DemoSpider2
    import multiprocessing
    
    def open_DemoSpider2():
        DemoSpider2.start()
    
    def open_DemoSpider():
        DemoSpider.start()
    
    if __name__ == "__main__":
        p1 = multiprocessing.Process(target = open_DemoSpider)
        p2 = multiprocessing.Process(target = open_DemoSpider2)
        p1.start()
        p2.start()

    You can also go to github and ask for star support!

    Link: https://github.com/lixi5338619/asyncpy

    Topics: Python Windows github pip