web crawler explanation - urllib repository crawler - state - exception handling - browser masquerading technology, user agent settings

Posted by Quinton1337 on Wed, 28 Aug 2019 01:20:06 +0200

If the crawler does not have exception handling, the program will crash and stop working if an error occurs during the crawl, and exception handling will continue even if an error occurs

1. Common status codes

301: Redirect to a new URL, permanent
302: Redirect to temporary URL, not permanent
304: Requested resource not updated
400: Illegal request
401: Request unauthorized
403: No access
404: No corresponding page found
500: Error inside server
501: The server does not support the functionality required to fulfill the request

2. Exception handling

URLError captures exception information

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.error

try:                                    #Try to execute what's inside
    html = urllib.request.urlopen('http://www.xiaohuar.com/').read().decode("utf-8")
    print(html)

except urllib.error.URLError as e:      #If an error occurs
    if hasattr(e,"code"):               #If there is an error code
        print(e.code)                   #Print error code
    if hasattr(e,"reason"):             #If there is an error message
        print(e.reason)                 #Print error message

#Return Note Site No Crawler Access
# 403
# Forbidden

Browser masquerading Technology

Many websites use anti-crawling technology to detect whether there is User-Agent browser information in the request header information behind the scenes, and block this request if there is no explanation that it is not a browser access

So we need to disguise the browser header to request

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
url = 'https://www.qiushibaike.com/'#Grab Page URL
tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')  #Set Simulated Browser Header
b_tou = urllib.request.build_opener()               #Create Request Object
b_tou.addheaders=[tou]                              #Add header
html = b_tou.open(url).read().decode("utf-8")       #Start grabbing pages
print(html)

Note: We can see that this request was not requested by the urlopen() method, and urlopen() is not available at this time, but we will feel that it is very difficult to create build_opener() for each request, so we need to set the urlopen() method to request automatic headers

Set the request auto header using the urlopen() method, that is, set the user agent

install_opener() sets header information to global and automatically adds headers when urlopen() method requests

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
#Set header information
tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')  #Set Simulated Browser Header
b_tou = urllib.request.build_opener()               #Create Request Object
b_tou.addheaders=[tou]                              #Add header to request object
#Set header information to global and automatically add headers when urlopen() method requests
urllib.request.install_opener(b_tou)

#request
url = 'https://www.qiushibaike.com/'
html = urllib.request.urlopen(url).read().decode("utf-8")
print(html)

Create User Agent Pool

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import random   #Introducing random module files

def yh_dl():    #Create User Agent Pool
   yhdl = [
       'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
       'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
       'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
       'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
       'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
       'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
       'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
       'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
       'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
       'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+',
       'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)',
       'UCWEB7.0.2.37/28/999',
       'NOKIA5700/ UCWEB7.0.2.37/28/999',
       'Openwave/ UCWEB7.0.2.37/28/999',
       'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999'
       ]
   thisua = random.choice(yhdl)                    #Random access to proxy information
   headers = ("User-Agent",thisua)                 #Stitching header information
   opener = urllib.request.build_opener()          #Create Request Object
   opener.addheaders=[headers]                     #Add header to request object
   urllib.request.install_opener(opener)           #Set header information to global and automatically add headers when urlopen() method requests

#request
yh_dl()     #Execute User Agent Pool Function
url = 'https://www.qiushibaike.com/'
html = urllib.request.urlopen(url).read().decode("utf-8")
print(html)

This way the crawler will call it randomly, and the user agent, also known as the random header, guarantees that the header information is different each time

Topics: Python Windows Mac OS X Mobile