Urllib is a built-in official standard library without downloading; It is a combination of urllib and urllib 2 in Python 2. Urllib 3 library is a third-party standard library, which solves thread safety and adds functions such as connection pool. Urllib and urllib 3 complement each other;
1, urllib Library
The urllib library mainly includes four modules:
- urllib.requests: request module
- urlib.error: exception handling module
- urllib.parse: url parsing module
- urllib.robotparser : robots.txt parsing module
1.1,urllib.request module
The request module is mainly responsible for constructing and initiating network requests, and adding Headers, proxies, etc.
It can simulate the request initiation process of the browser.
1. Initiate a network request.
2. Add Headers.
3. Operate cookie s.
4. Use agents.
1.1.1 initiate network request
urlopen method
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Function: urlopen is a method to send a simple network request, and then return the result.
Parameters:
① url: required; It can be a string or a Request object.
② Data: None – GET request; There is data (byte type / file object / iteratable object) – POST request (in case of POST request, the data will be put into the form for submission);
③ Timeout: default setting; In seconds, for example: set timeout=0.1 and timeout to 0.1 seconds (if it exceeds this time, an error will be reported!)
Return value: the class or method in the urlib library will return a urllib after sending a network request The object of the response. It contains the results of the requested data. It contains some properties and methods for us to process the returned results.
Example:
from urllib import request # test_url="http://httpbin.org/get "Note that if you use the get request, the data should be empty test_url="http://httpbin.org/post" res=request.urlopen(test_url,data=b"spider") print(res.read())#All contents of byte string print(res.getcode())#Get status code print(res.info())#Get response header information print(res.read())#The byte string is read again and empty
Request object
Using urlopen can initiate the most basic request, but these simple parameters are not enough to build a complete request (add request headers and different request methods). You can build a more complete request through construction.
class Request: def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None): pass
Function: Request is an object that constructs a complete network Request, and then returns the Request object.
Parameters:
① url: required; Is a string
② data: byte type
③ headers: request header information
④ method: GET by default. You can fill in POST, PUT, DELETE, etc
Return value: a request object
Example:
from urllib import request #Request object test_url="http://httpbin.org/get" headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"} req=request.Request(test_url,headers=headers) res=request.urlopen(req) print(res.read()) #Use of data method parameter of Request object print("************************************") test_url="http://httpbin.org/put" req=request.Request(test_url,headers=headers,data=b"updatedata",method="PUT") res=request.urlopen(req) print(res.read())
response object
The classes or methods in the urlib library will return a urllib after sending a network request The object of the response. It contains the results of the requested data. It contains some properties and methods for us to process the returned results.
- read() gets the data returned by the response. It can only be used once
- readline() reads a line
- info() get response header information
- geturl() gets the url to access
- getcode() returns the status code
1.1.2 add request header
from urllib import request test_url="http://httpbin.org/get" headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"} req=request.Request(test_url,headers=headers) res=request.urlopen(req) print(res.read())
1.1.3. Operating cookie s
from urllib import request from http import cookiejar #Create a cookie object cookie=cookiejar.CookieJar() #Create a cookie handler cookies=request.HTTPCookieProcessor(cookie) #Take it as a parameter to create an Openner object opener=request.build_opener(cookies) #Use this opener to send requests res=opener.open("http://www.baidu.com")
1.1.4. Setting agent
from urllib import request url='http://httpbin.org/ip' #Proxy address proxy={'http':'180.76.111.69:3128'} #Agent processor proxies=request.ProxyHandler(proxy) #Creating an opener object opener=request.build_opener(proxies) res=opener.open(url) print(res.read().decode())
1.2,urllib.parse module
The parse module is a tool module that provides a method for url processing. It is used to parse the url. The url can only contain ascii characters. In the actual operation process, there will be a large number of special characters in the parameters passed by the get request through the url, such as Chinese characters, so url coding is required.
1.2.1 transcoding of single parameter
- parse.quote() Chinese character to ascII code
from urllib import parse name="cartoon" asc_name=parse.quote(name)# Chinese character to ascII code print(asc_name) #Result:% E5%8A%A8%E7%94%BB%E7%89%87
- parse.unquote() ascll to Chinese
from urllib import parse name = '%E5%8A%A8%E7%94%BB%E7%89%87' print(parse.unquote(name)) #Result: Animation
1.2.2 transcoding multiple parameters
When sending a request, you often need to pass a lot of parameters. It will be troublesome to splice with string method. Parse The URLEncode () method can convert the dictionary into the request parameters of the url and complete the splicing. You can also use parse parse_ The QS () method returns it to the dictionary.
- parse.urlencode()
- parse.parse_qs()
Example:
from urllib import parse,request #parse. The URLEncode () method converts the dictionary to the request parameter of the url params={"name":"film","name2":"TV play","name3":"cartoon"} asc_name=parse.urlencode(params)# To convert dictionary form to url request parameter form print(asc_name)#name=%E7%94%B5%E5%BD%B1&name2=%E7%94%B5%E8%A7%86%E5%89%A7&name3=%E5%8A%A8%E7%94%BB%E7%89%87 test_url="http://httpbin.org/get?{}".format(asc_name) print(test_url) res=request.urlopen(test_url) print(res.read()) #parse_qs is converted back to its original form new_params=parse.parse_qs(asc_name) print(new_params)#{'name': ['movie'],'name2 ': [' TV play '],'name3': ['cartoon']}
1.3,urllib.error module
1.3.1 URLError and HTTPError
The error module is mainly responsible for handling exceptions. If an error occurs in the request, we can use the error module to handle it, mainly including URLError and HTTPError.
- URLError: it is the base class of the error exception module. Exceptions generated by the request module can be handled with this class.
- HTTPError: is a subclass of URLError, which mainly contains three attributes:
- Code: status code of the request
- Reason: the reason for the error
- headers: the header of the response
Example:
from urllib import error,request try: res=request.urlopen("https://jianshu.com") print(res.read()) except error.HTTPError as e: print('Requested status code:',e.code) print('Cause of error:',e.reason) print('Response header:',e.headers) ------------result----------------- Requested status code: 403 Cause of error: Forbidden Response header: Server: Tengine Date: Mon, 12 Jul 2021 04:40:02 GMT Content-Type: text/html Content-Length: 584 Connection: close Vary: Accept-Encoding Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
1.4,urllib. Robot parse module
The robot parse module is mainly responsible for processing crawler protocol files, robots Txt parsing. (gentleman's agreement) crawlers generally do not abide by it, so they basically do not use this module;
View robots protocol: add robots after the web address Txt.
For example, Baidu's robots protocol( http://www.baidu.com/robots.txt)
The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". The website tells search engines which pages can be crawled and which pages cannot be crawled through Robots protocol.
robots.txt file is a text file. You can create and edit it by using any common text editor, such as Notepad, which comes with Windows system. robots.txt is a protocol, not a command. robots.txt is the first file to view when visiting a website in a search engine. robots.txt file tells the spider program what files can be viewed on the server.
2, Urllib 3 Library
2.1. Features
Urllib 3 is a powerful, well-organized Python library for HTTP clients. Many Python native systems have started to use urllib 3. Urllib3 provides many important features that are not available in the python standard library:
1. Thread safety; 2. Connection pool; 3. Client SSL/TLS authentication; 4. File segment code upload; 5. Assist in handling duplicate requests and HTTP relocation; 6. Support compression coding; 7. Support HTTP and SOCKS proxy; 8. 100% test coverage.
2.2 installation
Install with the pip command:
pip install urllib3
2.3 use of urlib3
2.3.1 basic steps of initiating a request
1. Import urllib3 Library
import utllib3
2. Instantiate a PoolManager object that handles all the details of connection pooling and thread safety
http=urllib3.PoolManager()
3. Send a request using the request method
res=http.request("GET","http://www.baidu.com")
2.3.2. request method
request(self, method, url, fields=None, headers=None,**urlopen_kw)
Function: send complete network request
Parameters:
① method: request methods get, post, PUT, DELETE
② url: string format
③ fields: dictionary type is converted to url parameter in GET request and form data in POST request
④ headers: dictionary type
Return value: response object
Example:
import urllib3 http = urllib3.PoolManager() url = 'http://httpbin.org/get' headers = {'header1':'python','header2':'java'} fields = {'name':'you','passwd':'12345'} res = http.request('GET',url,fields=fields,headers=headers) print('Status code:',res.status) print('Response header:',res.headers) print('data:',res.data)
2.3.3,Proxies
You can use ProxyManager for http proxy operations
import urllib3 proxy=urllib3.ProxyManager('http://180.76.111.69:31281') res=proxy.request('get','http://httpbin.org/ip') print(res.data)
2.3.4,Request data
- For get, head and delete requests, you can add query parameters by providing dictionary type parameter fields
import urllib3 http=urllib3.PoolManager() r=http.request('get','http://httpbin.org/get',fields={'mydata':'python'}) print(r.data.decode())
- For post and put requests, the parameters need to be encoded into the correct format through url encoding, and then spliced into the url
import urllib3 from urllib import parse http=urllib3.PoolManager() data = parse.urlencode({'myname':'pipi'}) url = 'http://httpbin.org/post?'+data r=http.request('post',url) print(r.data.decode())
- JSON
When initiating a request, you can send compiled JSON data by defining the body parameter and the content type parameter of headers
import urllib3 import json http=urllib3. PoolManager() data={'username':'python'} encoded_data=json.dumps(data).encode('utf-8') r=http.request('post', 'http://httpbin.org/post', body=encoded_data, headers={'Content-Type1':'appLication/json'}) print(json.loads(r.data.decode('utf-8'))['json'])
- Files
For file upload, we can imitate the way of browser form
import json import urllib3 http=urllib3.PoolManager() with open('example.txt') as fp: file_data=fp.read() r=http.request('POST', 'http://httpbin. org/post', fields={'filefield':('example.txt', file_data)} ) print(json.loads(r.data.decode('utf-8'))['files'])
- binary data
For binary data upload, we specify the body and set the content type request header
import urllib3 import json http=urllib3. PoolManager() with open('example.jpg','rb') as fb: binary_data=fb.read() r=http.request('post', 'http://httpbin.org/post', body=binary_data, headers={'Content-Type':'image/jpeg'} ) print(json.loads(r.data.decode('utf-8')))
2.3.4. response object
- The http response object provides properties such as status, data, and header
import urllib3 http=urllib3.PoolManager() r=http.request('GET','http://httpbin.org/ip') print(r.status) print(r.data) print(r.headers)
-
JSON content
The returned json format data can be through the json module, and loads is the dictionary data type -
Binary content
The data returned by the response is of byte type. For a large amount of data, we can process it better through stream
import urllib3 http=urllib3.PoolManager() r=http.request('GET','http://httpbin.org/bytes/10241', preload_content=False) for chunk in r.stream(32): print(chunk)
It can also be treated as a file object
import urllib3 http=urllib3.PoolManager() r=http.request('GET','http://httpbin.org/bytes/10241', preload_content=False) for line in r: print(line)