If the crawler does not have exception handling, the program will crash and stop working if an error occurs during the crawl, and exception handling will continue even if an error occurs
1. Common status codes
301: Redirect to a new URL, permanent
302: Redirect to temporary URL, not permanent
304: Requested resource not updated
400: Illegal request
401: Request unauthorized
403: No access
404: No corresponding page found
500: Error inside server
501: The server does not support the functionality required to fulfill the request
2. Exception handling
URLError captures exception information
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import urllib.error try: #Try to execute what's inside html = urllib.request.urlopen('http://www.xiaohuar.com/').read().decode("utf-8") print(html) except urllib.error.URLError as e: #If an error occurs if hasattr(e,"code"): #If there is an error code print(e.code) #Print error code if hasattr(e,"reason"): #If there is an error message print(e.reason) #Print error message #Return Note Site No Crawler Access # 403 # Forbidden
Browser masquerading Technology
Many websites use anti-crawling technology to detect whether there is User-Agent browser information in the request header information behind the scenes, and block this request if there is no explanation that it is not a browser access
So we need to disguise the browser header to request
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request url = 'https://www.qiushibaike.com/'#Grab Page URL tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0') #Set Simulated Browser Header b_tou = urllib.request.build_opener() #Create Request Object b_tou.addheaders=[tou] #Add header html = b_tou.open(url).read().decode("utf-8") #Start grabbing pages print(html)
Note: We can see that this request was not requested by the urlopen() method, and urlopen() is not available at this time, but we will feel that it is very difficult to create build_opener() for each request, so we need to set the urlopen() method to request automatic headers
Set the request auto header using the urlopen() method, that is, set the user agent
install_opener() sets header information to global and automatically adds headers when urlopen() method requests
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request #Set header information tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0') #Set Simulated Browser Header b_tou = urllib.request.build_opener() #Create Request Object b_tou.addheaders=[tou] #Add header to request object #Set header information to global and automatically add headers when urlopen() method requests urllib.request.install_opener(b_tou) #request url = 'https://www.qiushibaike.com/' html = urllib.request.urlopen(url).read().decode("utf-8") print(html)
Create User Agent Pool
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import random #Introducing random module files def yh_dl(): #Create User Agent Pool yhdl = [ 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5', 'User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5', 'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5', 'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1', 'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10', 'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13', 'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)', 'UCWEB7.0.2.37/28/999', 'NOKIA5700/ UCWEB7.0.2.37/28/999', 'Openwave/ UCWEB7.0.2.37/28/999', 'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999' ] thisua = random.choice(yhdl) #Random access to proxy information headers = ("User-Agent",thisua) #Stitching header information opener = urllib.request.build_opener() #Create Request Object opener.addheaders=[headers] #Add header to request object urllib.request.install_opener(opener) #Set header information to global and automatically add headers when urlopen() method requests #request yh_dl() #Execute User Agent Pool Function url = 'https://www.qiushibaike.com/' html = urllib.request.urlopen(url).read().decode("utf-8") print(html)
This way the crawler will call it randomly, and the user agent, also known as the random header, guarantees that the header information is different each time