Python third party Library -- urllib learning

Posted by pkmleo on Tue, 25 Jan 2022 02:57:29 +0100

Series articles

Python learning 01 - Fundamentals of Python
Python third party Library -- urllib learning
Python learning 02 - Python crawler

Python third party Library -- urllib learning

The urllib library is mainly used to manipulate web page URL s and crawl web page contents. It is usually used by crawlers in Python. Some functions are briefly described here for later crawlers.

1. Import urllib Library

# Since our crawler does not need to use the whole urllib, we only import some of the required
import urllib.request	# Used to initiate a request
import urllib.parse  # Used to parse URLs
import urllib.error  # Import exceptions that may be thrown to catch

# The current version requires a security certificate for https, so you need to import ssl and execute the following statement
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

2. get mode access

# Use urlopen() to return an HTML by default using the get request
response = urllib.request.urlopen("http://www.baidu.com")

# read() reads the entire HTML. In addition, readline() reads one line, and readlines() reads all and returns the list
# decode() decodes the content to utf-8
print(response.read().decode("utf-8"))

Running result: Baidu home page HTML source code

3. post access

# A post request can carry form data
# The data needs to be parsed. First use parse's urlencode to encode it, and then use bytes to convert it into utf-8 byte stream.
data = bytes(urllib.parse.urlencode({"username": "Xiaobai", "password": "123456"}), encoding="utf-8")
# http://httpbin.org/post Is an HTTP test website
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response.read().decode("utf-8"))

Operation results:

Explain:

The web address used here is http://httpbin.org/post , which is the test URL provided by httpbin.

We can use the methods provided in it to test our request. For example, the post request provided by it is used here.

When we click Try it out and then click excess, it is actually the browser's response to the link http://httpbin.org/post A POST request is made, and the following Response body is the response from the server to our browser.

4. Timeout processing

In reality, crawlers may be slow due to unexpected situations such as network fluctuations or blocked by websites. Therefore, limit the request time. When it exceeds a certain time, skip the current crawl first. You only need to add one parameter.

# Timeout processing (using get test): set the timeout parameter to 0.01 seconds timeout, and the URLError exception will be thrown when timeout occurs.
try:
	response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)
	print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
	print("overtime")

5. Web page related information

For each request, the server will have the status code and response header of the response, which can be read directly from the return value.

# Get web page information
response = urllib.request.urlopen("http://baidu.com")
print(response.status)  # Status code 200 indicates success, 404 indicates that the web page cannot be found, and 418 indicates that the crawler is rejected
print(response.getheaders())  # Get response header
print(response.getheader("Server"))  # Get a parameter in the response header

6. Simulation browser

# Simulation browser
url = "https://www.douban.com"
# This is to add a request header to simulate the browser
headers = {
	"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
# Douban is the Get mode
# Using post is like this: urllib request. Request(url=url, data=data, headers=headers, method="POST")
req = urllib.request.Request(url=url, headers=headers) # The default is Get
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

Note: HTTPS requests need to import the SSL module and execute SSL_ create_ default_ https_ context = ssl._ create_ unverified_ context

Explain:

The request header contains the information provided by our request for the target URL. The user agent represents the user agent, that is, what we use to access. If we do not add the user agent of headers, we will be exposed as a python urlib 3.9 crawler.

Topics: Python urllib