Get free proxy IP
In this part, I hope to get some free IP of mainstream proxy websites for my personal use. Because the free IP availability is relatively poor compared with the private agent, I hope to verify the availability after obtaining the agent and save the available IP locally. At the same time, I want to be able to update the IP list.
Required modules
import requests from lxml import etree import time import datetime import random import os from pathlib import Path
IP address detection
Next, write the IPDetector class. The methods in this class are used to save the IP address locally, and identify the website and date to obtain the IP in the local file. (where IPValidator is the validation class, which will be posted in the next section).
class IPDetector: """ IP Address detection class """ @staticmethod def detector_of_xicidaili(): # West stab proxy IP list page URL url = 'https://www.xicidaili.com/nn/' # Create a file stream fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-xicidaili-' + str(datetime.date.today()) + '.txt', 'w', encoding='utf-8') # Get the IP address of the first 9 pages for i in range(1, 10): # request with requests.get(url + str(i)) as response: # If the request is wrong, skip the cycle to visit the next page if response.status_code != 200: continue # Parse to xml tree html = etree.HTML(response.content) # Traversal from the second tr tag j = 2 while True: # Stop until element cannot be found if not html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]' % j): break ip = html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]/text()' % j)[0] port = html.xpath('//*[@id="ip_list"]/tr[%d]/td[3]/text()' % j)[0] # Verify IP validity if IPValidator.validate(ip, port): fp.write(ip + ':' + port) fp.write('\n') j += 1 # Close file stream fp.close() @staticmethod def detector_of_kuaidaili(): # Fast proxy IP list page URL url = 'https://www.kuaidaili.com/free/inha/' # Create a file stream fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-kuaidaili-' + str(datetime.date.today()) + '.txt', 'w', encoding='utf-8') # Get the IP address of the first 4 pages for i in range(1, 5): # request with requests.get(url + str(i)) as response: # If the request is wrong, skip the cycle to visit the next page if response.status_code != 200: continue html = etree.HTML(response.content) j = 1 while True: if not html.xpath('//*[@id="list"]/table/tbody/tr[1]/td[%d]' % j): break ip = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[1]/text()' % j)[0] port = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[2]/text()' % j)[0] if IPValidator.validate(ip, port): fp.write(ip + ':' + port) fp.write('\n') j += 1 # Breakthrough fast agent time detection time.sleep(random.randint(1, 5)) # Close file stream fp.close()
This part of the code is relatively easy to understand. For novices, there are two points to pay attention to. The first is to delete the tbody in the path when obtaining the xpath. The second is to quickly @ on behalf of @ to respond to requests with short prohibition interval. sleep is OK.
IP validity test
In this section, I will write the IPValidator class, which is used to check whether the proxy IP is available. The principle is very simple, visit Baidu (or custom website) to see if you can get 200 status.
class IPValidator: """ IP Address validity test """ ''' //Parameters are IP address and port number //If you need to specify the test website, you can set it in the domain parameter, which is Baidu by default ''' @staticmethod def validate(ip, port, domain='https://www.baidu.com'): ip_and_port = str(ip) + ":" + str(port) proxies = {'http': 'http://' + ip_and_port} try: response = requests.get(domain, proxies=proxies, timeout=3) if response.status_code == 200: return True except: return False return False
Now you can call ipdetector. Detector "of xicidiali(); get the available IP of the day and save it locally.
Get from local IP list
This part does not need to know much about crawler technology, mainly file reading and writing. The IPGetter class provides four methods to return IP in the form of 'http://host:port' string or IP in the form of dictionary.
class IPGetter: @staticmethod def get_an_ip(): # If there is an IP list obtained today, read it from today's list try: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' + str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8') # Otherwise read from the spare IP list except IOError: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' + str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8') # Read from file to list ip_list = fp.readlines() # Not available if list length is 0, read from alternate list if len(ip_list) == 0: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8') ip_list = fp.readlines() # Close file stream fp.close() # Returns a random IP return random.sample(ip_list, 1)[0] @staticmethod def get_ip_list(): # If there is an IP list obtained today, read it from today's list try: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' + str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8') # Otherwise, read from yesterday's IP list except IOError: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' + str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8') # Read from file to list ip_list = fp.readlines() # Not available if list length is 0, read from alternate list if len(ip_list) == 0: fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8') ip_list = fp.readlines() # Close file stream fp.close() # Return to IP list return ip_list @staticmethod def get_a_proxy(): return {'http': IPGetter.get_an_ip()} @staticmethod def get_proxy_list(): return [{'http': i} for i in IPGetter.get_ip_list()]
Due to the need to write crawlers in different systems, this part of the code uses the Pathlib library, mainly to deal with different system path formats.
Now just reference this class and call the methods in this class to use proxy IP.
Use examples under requests
from File names of the above classes import IPGetter response = requests.get(domain, proxies=IPGetter.get_a_proxy())
Using guide under Scrapy
We use middleware to solve the problem of proxy in Scrapy. First, let's take a look at the proxy middleware of Scrapy HttpProxyMiddleware.
Scrapy's native httproxymeddleware supports setting three environment variables: http proxy, HTTPS proxy, and no proxy to use proxy IP. But if we want to use different IP for every request when pretending to be a crawler, this method is difficult to deal with.
So the last paragraph of HttpProxyMiddleware document writes the method of setting meta key to set meta key for spider's request.
yield Request(url=page, callback=self.parse, meta={"proxy": IPGetter.get_a_proxy()})
The IPGetter here is the IPGetter written above.
But in this way, we need to modify every parse function in this way, so we need to customize a proxy middleware, open middlewares.py, and create a custom proxy middleware.
class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = 'http://' + IPGetter.get_an_ip() def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
Open settings.py, enable custom agent middleware, disable native agent middleware, and set the priority of custom agent middleware to the priority of native agent middleware. The priority of native agent middleware can be set in Scrapy document Check in.
DOWNLOADER_MIDDLEWARES = { # Turn off the default proxy middleware and replace it with your own 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'digikey_spider.middlewares.ProxyMiddleware': 551, }
In this way, we can use different IP for each request in Scrapy.
Scrapy agent middleware supplementary instructions
The above is a more convenient way for me, but since we have written so many, we might as well study again. If we are more obsessive-compulsive, we hope to use scratch native agent middleware to solve this problem. By the way, we will also analyze the source code with you. If you don't want to look at the source code analysis, you can directly see the final conclusion.
Let's take a look at the story. Downloadermiddlewares. Httpproxy. Httpproxymiddleware class.
Let's start with constructors.
def __init__(self, auth_encoding='latin-1'): self.auth_encoding = auth_encoding self.proxies = {} for type_, url in getproxies().items(): self.proxies[type_] = self._get_proxy(url, type_)
Let's focus on the proxies attribute, which is a dictionary initialized by the following loop statement. Let's take a look at the getproxies() method, which comes from the urllib.request module.
# Proxy handling def getproxies_environment(): """Return a dictionary of scheme -> proxy server URL mappings. Scan the environment for variables named <scheme>_proxy; this seems to be the standard convention. If you need a different way, you can pass a proxies dictionary to the [Fancy]URLopener constructor. """ proxies = {} # in order to prefer lowercase variables, process environment in # two passes: first matches any, second pass matches lowercase only for name, value in os.environ.items(): name = name.lower() if value and name[-6:] == '_proxy': proxies[name[:-6]] = value # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY # (non-all-lowercase) as it may be set from the web server by a "Proxy:" # header from the client # If "proxy" is lowercase, it will still be used thanks to the next block if 'REQUEST_METHOD' in os.environ: proxies.pop('http', None) for name, value in os.environ.items(): if name[-6:] == '_proxy': name = name.lower() if value: proxies[name[:-6]] = value else: proxies.pop(name[:-6], None) return proxies
You can see that this method will get the key value pair from the environment variable, and find the environment variable whose last six characters of the key name are "U proxy" (case does not matter), and the corresponding value of the key exists. After these variables are found, the variables other than the last six characters of the key name are used as the key name of the dictionary, and the values corresponding to the key are saved in the dictionary and returned.
For example, the following key value pairs exist in the environment variable:
http_proxy:0.0.0.0:0000, https_proxy:1.1.1.1:1111, aa:2.2.2.2:2222.
The getproxies environment() method returns the following Dictionary:
{'http': '0.0.0.0:0000', 'https': '1.1.1.1:1111'}
Now go back to the constructor of story.downloadermiddleware.httpproxy.httpproxymiddleware. We can know that the loop statement will parse the read environment variables into key value pairs of proxy type and address by using the get proxy (URL, type) method, and save them in self.proxies.
Then let's look at the process_request() method, which first looks at whether there is a proxy in the meta key requested by the spider. Our custom proxy middleware just uses the method of setting the meta key. If there is a proxy in the meta key, it will take the proxy to request directly. If not, the proxy found in the environment variable is used.
But let's look at the set proxy () method again.
def _set_proxy(self, request, scheme): creds, proxy = self.proxies[scheme] request.meta['proxy'] = proxy if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
It will be found that in the end, it is still the use of setting meta key to set proxy IP. Ha ha.
So if you want to use native middleware to solve the proxy problem, just set HTTP proxy in the environment variable and change it every time you request.
import os os.environ['http_proxy'] = 'agent IP address'
... Is it better to set the meta key directly in the request.