Crawler combat record: establishment and use of IP pool in requests and scrape (and detailed explanation of the scrape agent middleware)

Posted by developer on Mon, 02 Mar 2020 08:14:57 +0100

Get free proxy IP

In this part, I hope to get some free IP of mainstream proxy websites for my personal use. Because the free IP availability is relatively poor compared with the private agent, I hope to verify the availability after obtaining the agent and save the available IP locally. At the same time, I want to be able to update the IP list.

Required modules
import requests
from lxml import etree

import time
import datetime
import random

import os
from pathlib import Path
IP address detection

Next, write the IPDetector class. The methods in this class are used to save the IP address locally, and identify the website and date to obtain the IP in the local file. (where IPValidator is the validation class, which will be posted in the next section).

class IPDetector:
    """
    IP Address detection class
    
    """

    @staticmethod
    def detector_of_xicidaili():

        # West stab proxy IP list page URL
        url = 'https://www.xicidaili.com/nn/'

        # Create a file stream
        fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-xicidaili-' + str(datetime.date.today()) +
                  '.txt', 'w', encoding='utf-8')

        # Get the IP address of the first 9 pages
        for i in range(1, 10):

            # request
            with requests.get(url + str(i)) as response:

                # If the request is wrong, skip the cycle to visit the next page
                if response.status_code != 200:
                    continue

                # Parse to xml tree
                html = etree.HTML(response.content)

                # Traversal from the second tr tag
                j = 2
                while True:

                    # Stop until element cannot be found
                    if not html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]' % j):
                        break

                    ip = html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]/text()' % j)[0]

                    port = html.xpath('//*[@id="ip_list"]/tr[%d]/td[3]/text()' % j)[0]

                    # Verify IP validity
                    if IPValidator.validate(ip, port):
                        fp.write(ip + ':' + port)
                        fp.write('\n')

                    j += 1

        # Close file stream
        fp.close()

    @staticmethod
    def detector_of_kuaidaili():

        # Fast proxy IP list page URL
        url = 'https://www.kuaidaili.com/free/inha/'

        # Create a file stream
        fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-kuaidaili-' + str(datetime.date.today()) + '.txt', 'w',
                  encoding='utf-8')

        # Get the IP address of the first 4 pages
        for i in range(1, 5):

            # request
            with requests.get(url + str(i)) as response:

                # If the request is wrong, skip the cycle to visit the next page
                if response.status_code != 200:
                    continue

                html = etree.HTML(response.content)

                j = 1
                while True:

                    if not html.xpath('//*[@id="list"]/table/tbody/tr[1]/td[%d]' % j):
                        break

                    ip = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[1]/text()' % j)[0]

                    port = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[2]/text()' % j)[0]

                    if IPValidator.validate(ip, port):
                        fp.write(ip + ':' + port)
                        fp.write('\n')
                    j += 1

            # Breakthrough fast agent time detection
            time.sleep(random.randint(1, 5))

        # Close file stream
        fp.close()

This part of the code is relatively easy to understand. For novices, there are two points to pay attention to. The first is to delete the tbody in the path when obtaining the xpath. The second is to quickly @ on behalf of @ to respond to requests with short prohibition interval. sleep is OK.

IP validity test

In this section, I will write the IPValidator class, which is used to check whether the proxy IP is available. The principle is very simple, visit Baidu (or custom website) to see if you can get 200 status.

class IPValidator:
    """
    IP Address validity test

    """

    '''
    //Parameters are IP address and port number
    //If you need to specify the test website, you can set it in the domain parameter, which is Baidu by default
    
    '''

    @staticmethod
    def validate(ip, port, domain='https://www.baidu.com'):

        ip_and_port = str(ip) + ":" + str(port)
        proxies = {'http': 'http://' + ip_and_port}

        try:
            response = requests.get(domain, proxies=proxies, timeout=3)
            if response.status_code == 200:
                return True

        except:
            return False

        return False

Now you can call ipdetector. Detector "of xicidiali(); get the available IP of the day and save it locally.

Get from local IP list

This part does not need to know much about crawler technology, mainly file reading and writing. The IPGetter class provides four methods to return IP in the form of 'http://host:port' string or IP in the form of dictionary.

class IPGetter:

    @staticmethod
    def get_an_ip():

        # If there is an IP list obtained today, read it from today's list
        try:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8')

        # Otherwise read from the spare IP list
        except IOError:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8')

        # Read from file to list
        ip_list = fp.readlines()

        # Not available if list length is 0, read from alternate list
        if len(ip_list) == 0:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8')
            ip_list = fp.readlines()

        # Close file stream
        fp.close()

        # Returns a random IP
        return random.sample(ip_list, 1)[0]

    @staticmethod
    def get_ip_list():

        # If there is an IP list obtained today, read it from today's list
        try:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8')

        # Otherwise, read from yesterday's IP list
        except IOError:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8')

        # Read from file to list
        ip_list = fp.readlines()

        # Not available if list length is 0, read from alternate list
        if len(ip_list) == 0:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8')
            ip_list = fp.readlines()

        # Close file stream
        fp.close()

        # Return to IP list
        return ip_list

    @staticmethod
    def get_a_proxy():
        return {'http': IPGetter.get_an_ip()}

    @staticmethod
    def get_proxy_list():
        return [{'http': i} for i in IPGetter.get_ip_list()]

Due to the need to write crawlers in different systems, this part of the code uses the Pathlib library, mainly to deal with different system path formats.
Now just reference this class and call the methods in this class to use proxy IP.

Use examples under requests
from File names of the above classes import IPGetter

response = requests.get(domain, proxies=IPGetter.get_a_proxy())
Using guide under Scrapy

We use middleware to solve the problem of proxy in Scrapy. First, let's take a look at the proxy middleware of Scrapy HttpProxyMiddleware.
Scrapy's native httproxymeddleware supports setting three environment variables: http proxy, HTTPS proxy, and no proxy to use proxy IP. But if we want to use different IP for every request when pretending to be a crawler, this method is difficult to deal with.
So the last paragraph of HttpProxyMiddleware document writes the method of setting meta key to set meta key for spider's request.

yield Request(url=page, callback=self.parse, meta={"proxy": IPGetter.get_a_proxy()})

The IPGetter here is the IPGetter written above.
But in this way, we need to modify every parse function in this way, so we need to customize a proxy middleware, open middlewares.py, and create a custom proxy middleware.

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://' + IPGetter.get_an_ip()

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Open settings.py, enable custom agent middleware, disable native agent middleware, and set the priority of custom agent middleware to the priority of native agent middleware. The priority of native agent middleware can be set in Scrapy document Check in.

DOWNLOADER_MIDDLEWARES = {
    # Turn off the default proxy middleware and replace it with your own
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'digikey_spider.middlewares.ProxyMiddleware': 551,
}

In this way, we can use different IP for each request in Scrapy.

Scrapy agent middleware supplementary instructions

The above is a more convenient way for me, but since we have written so many, we might as well study again. If we are more obsessive-compulsive, we hope to use scratch native agent middleware to solve this problem. By the way, we will also analyze the source code with you. If you don't want to look at the source code analysis, you can directly see the final conclusion.

Let's take a look at the story. Downloadermiddlewares. Httpproxy. Httpproxymiddleware class.
Let's start with constructors.

    def __init__(self, auth_encoding='latin-1'):
        self.auth_encoding = auth_encoding
        self.proxies = {}
        for type_, url in getproxies().items():
            self.proxies[type_] = self._get_proxy(url, type_)

Let's focus on the proxies attribute, which is a dictionary initialized by the following loop statement. Let's take a look at the getproxies() method, which comes from the urllib.request module.

# Proxy handling
def getproxies_environment():
    """Return a dictionary of scheme -> proxy server URL mappings.

    Scan the environment for variables named <scheme>_proxy;
    this seems to be the standard convention.  If you need a
    different way, you can pass a proxies dictionary to the
    [Fancy]URLopener constructor.

    """
    proxies = {}
    # in order to prefer lowercase variables, process environment in
    # two passes: first matches any, second pass matches lowercase only
    for name, value in os.environ.items():
        name = name.lower()
        if value and name[-6:] == '_proxy':
            proxies[name[:-6]] = value
    # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY
    # (non-all-lowercase) as it may be set from the web server by a "Proxy:"
    # header from the client
    # If "proxy" is lowercase, it will still be used thanks to the next block
    if 'REQUEST_METHOD' in os.environ:
        proxies.pop('http', None)
    for name, value in os.environ.items():
        if name[-6:] == '_proxy':
            name = name.lower()
            if value:
                proxies[name[:-6]] = value
            else:
                proxies.pop(name[:-6], None)
    return proxies

You can see that this method will get the key value pair from the environment variable, and find the environment variable whose last six characters of the key name are "U proxy" (case does not matter), and the corresponding value of the key exists. After these variables are found, the variables other than the last six characters of the key name are used as the key name of the dictionary, and the values corresponding to the key are saved in the dictionary and returned.

For example, the following key value pairs exist in the environment variable:
http_proxy:0.0.0.0:0000, https_proxy:1.1.1.1:1111, aa:2.2.2.2:2222.

The getproxies environment() method returns the following Dictionary:
{'http': '0.0.0.0:0000', 'https': '1.1.1.1:1111'}

Now go back to the constructor of story.downloadermiddleware.httpproxy.httpproxymiddleware. We can know that the loop statement will parse the read environment variables into key value pairs of proxy type and address by using the get proxy (URL, type) method, and save them in self.proxies.
Then let's look at the process_request() method, which first looks at whether there is a proxy in the meta key requested by the spider. Our custom proxy middleware just uses the method of setting the meta key. If there is a proxy in the meta key, it will take the proxy to request directly. If not, the proxy found in the environment variable is used.
But let's look at the set proxy () method again.

    def _set_proxy(self, request, scheme):
        creds, proxy = self.proxies[scheme]
        request.meta['proxy'] = proxy
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

It will be found that in the end, it is still the use of setting meta key to set proxy IP. Ha ha.

So if you want to use native middleware to solve the proxy problem, just set HTTP proxy in the environment variable and change it every time you request.

import os

os.environ['http_proxy'] = 'agent IP address'

... Is it better to set the meta key directly in the request.

Published 1 original article, praised 0, visited 11
Private letter follow

Topics: encoding xml Attribute Web Server