Previously, we said that one of the common methods of anti-crawler is to detect IP and limit the frequency of access. So we need to bypass this limitation by setting up proxy ip. There are many websites that offer free proxy ip, such as https://www.xicidaili.com/nt/ We can get a lot of proxy IPS from the website. But not every of these IPS can be used, or very few can be used.
We can use beautifulsoup to analyze web pages, then process, extract the proxy ip list, or use regular expressions to match. It's faster to use regular expressions. ip_url is https://www.xicidaili.com/nt/ random_hearder is a function that randomly gets the request header.
def download_page(url): headers = random_header() data = requests.get(url, headers=headers) return data def get_proxies(page_num, ip_url): available_ip = [] for page in range(1,page_num): print("Grasping the first%d Page proxy IP" %page) url = ip_url + str(page) r = download_page(url) r.encoding = 'utf-8' pattern = re.compile('<td class="country">.*?alt="Cn" />.*?</td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>', re.S) ip_list = re.findall(pattern, r.text) for ip in ip_list: if test_ip(ip): print('%s:%s By testing, add to the list of available agents' %(ip[0],ip[1])) available_ip.append(ip) time.sleep(10)print('Grab end') return available_ip
After getting the ip, we also need to check the IP to make sure that the IP can be used. How to detect it? We can use proxy IP to access a website that can display the access ip, and then check the result of the request.
def test_ip(ip,test_url='http://ip.tool.chinaz.com/'): proxies={'http': ip[0]+':'+ip[1]} try_ip=ip[0] try: r=requests.get(test_url, headers=random_header(), proxies=proxies) if r.status_code==200: r.encoding='gbk' result=re.search('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',r.text) result=result.group() print(result) if result[:9]==try_ip[:9]:print('%S:%sTest pass' % (ip[0],ip[1])) return True else: print('%s:%s Failure of carrying agent,Local use is used IP' %(ip[0],ip[1])) return False else: print('%s:%s Request code is not 200' %(ip[0],ip[1])) return False except Exception as e: print(e) print('%s:%s error' %(ip[0],ip[1])) return False
Some tutorials just get 200 http status codes and think they're successful. That's wrong. Because proxy IP access is unsuccessful, you will default to use your own ip. Of course I can succeed with my own IP access.
Finally, we need to detect IP before we use it, because you don't know when it will not be available. So usually store more proxy ip, so as not to be useless when you need it.
Code reference for this article https://blog.csdn.net/XRRRICK/article/details/78650764 I made a few changes.