Crawler establishes proxy ip pool

Posted by jwalsh on Tue, 01 Oct 2019 15:25:30 +0200

Previously, we said that one of the common methods of anti-crawler is to detect IP and limit the frequency of access. So we need to bypass this limitation by setting up proxy ip. There are many websites that offer free proxy ip, such as https://www.xicidaili.com/nt/ We can get a lot of proxy IPS from the website. But not every of these IPS can be used, or very few can be used.

 

We can use beautifulsoup to analyze web pages, then process, extract the proxy ip list, or use regular expressions to match. It's faster to use regular expressions. ip_url is https://www.xicidaili.com/nt/ random_hearder is a function that randomly gets the request header.

def download_page(url):
    headers = random_header()
    data = requests.get(url, headers=headers)
    return data


def get_proxies(page_num, ip_url):
    available_ip = []
    for page in range(1,page_num):
        print("Grasping the first%d Page proxy IP" %page)
        url = ip_url + str(page)
        r = download_page(url)
        r.encoding = 'utf-8'
        pattern = re.compile('<td class="country">.*?alt="Cn" />.*?</td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>', re.S)
        ip_list = re.findall(pattern, r.text)
        for ip in ip_list:
            if test_ip(ip):
                print('%s:%s By testing, add to the list of available agents' %(ip[0],ip[1]))
                available_ip.append(ip)
        time.sleep(10)print('Grab end')
    return available_ip

After getting the ip, we also need to check the IP to make sure that the IP can be used. How to detect it? We can use proxy IP to access a website that can display the access ip, and then check the result of the request.

def test_ip(ip,test_url='http://ip.tool.chinaz.com/'):
    proxies={'http': ip[0]+':'+ip[1]}
    try_ip=ip[0]
    try:
        r=requests.get(test_url, headers=random_header(), proxies=proxies)
        if r.status_code==200:
            r.encoding='gbk'
            result=re.search('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',r.text)
            result=result.group()
            print(result)
            if result[:9]==try_ip[:9]:print('%S:%sTest pass' % (ip[0],ip[1]))
                return True
            else:
                print('%s:%s Failure of carrying agent,Local use is used IP' %(ip[0],ip[1]))
                return False
        else:
            print('%s:%s Request code is not 200' %(ip[0],ip[1]))
            return False
    except Exception as e:
        print(e)
        print('%s:%s error' %(ip[0],ip[1]))
        return False

Some tutorials just get 200 http status codes and think they're successful. That's wrong. Because proxy IP access is unsuccessful, you will default to use your own ip. Of course I can succeed with my own IP access.

 

Finally, we need to detect IP before we use it, because you don't know when it will not be available. So usually store more proxy ip, so as not to be useless when you need it.

 

Code reference for this article https://blog.csdn.net/XRRRICK/article/details/78650764 I made a few changes.

Topics: Python encoding