Use of basic library
urllib
urllib library is Python's built-in HTTP request library, that is, it can be used without additional installation. It includes the following four modules:
- Request: it is a basic HTTP request library, which can be used to simulate sending requests Just like entering the web address in the browser and then entering, you can simulate this process by passing in the URL and additional parameters to the library method
- error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly
- parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc
- Robot parser: it is mainly used to identify the robots of the website Txt file, and then judge which websites can be crawled and which websites cannot be crawled
Send request
urllib. The urlopen() method in the request returns an object of type HTTPResponse, which contains the following methods:
method | explain |
---|---|
read() | Web page content return |
getheaders() | Returns the full header information of the response |
getheader(name) | Returns the value corresponding to the parameter name in the header information of the response |
Properties:
attribute | explain |
---|---|
status | Response status code |
reason | Cause of abnormality |
Optional parameters of urlopen method:
parameter | explain |
---|---|
data | Additional data (must be byte stream) |
timeout | Timeout time (that is, timeout exception will be thrown after exceeding the set time) |
context | Must be SSL Sslcontext type, used to specify SSL settings |
cafile | Specify CA certificate |
capath | Path of CA certificate |
Using urlib parse. URLEncode (dict, encoding ='utf-8 ') converts the parameter dictionary into a string, and then uses bytes (the parameter cannot be a dictionary) to convert the data into bytes for transmission
The requests class builds requests
import urllib.request req = urllib.request.Request('https://python.org') res = urllib.request.urlopen(req) print(res.read().decode('utf-8'))
Parameters that can be received by Request:
parameter | explain |
---|---|
url | Required parameters! URL for request |
data | Must pass byte type |
headers | The request header can be constructed when building the request, and add can also be used_ The header () method is added |
origin_req_host | It refers to the host name or IP address of the requestor |
unverifiable | Indicates whether the request cannot be verified. The default is false |
method | Used to indicate the methods used by the request, such as GET,POST, and PUT |
The most common way to add request headers is to camouflage the browser by modifying the user agent. The default user agent is Python urlib. We can camouflage the browser by modifying it
Hangler subclass of BaseHandler class:
Subclass | explain |
---|---|
HTTPDefaultErrorHandler | It is used to handle HTTP response errors. All errors will throw HTTPError type exceptions |
HTTPRedirectHandler | Used to handle redirection |
HTTPCookieProcessor | Used to process Cookies |
ProxyHandler | Used to set the proxy. The default proxy is empty |
HTTPPasswordMgr | Used to manage passwords. It maintains a table of user names and passwords |
HTTPBasicAuthHandler | It is used to manage authentication. If a link needs authentication when it is opened, it can be used to solve the authentication problem |
Opener class
Opener class can be used to create opener objects, which can be used to complete deep configuration. The return type of opener's open() method is similar to urlopen() method
Use the Opener class and httpbasicauthendler to solve authentication:
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener from urllib.error import URLError usename = 'usename' password = 'password' url = 'http://localhost:5000/' p = HTTPPasswordMgrWithDefaultRealm() #Instantiate the HTTPPasswordMgrWithDefaultRealm object p.add_password(None,url,usename,password) #Add user name and password auth_handler = HTTPBasicAuthHandler(p) #Instantiate HTTPBasicAuthHandler object with HTTPPasswordMgrWithDefaultRealm object opener = build_opener(auth_handler) #Use the established Handler to build an Opener object try: result = Opener.open(url) html = result.read().decode('utf-8') print(html) except URLError as e: print(e.reason)
Agent:
from urllib.error import URLError from urllib.request import ProxyHandler,build_opener proxy_handler = ProxyHandler({ 'http':'' 'https':'' }) opener = build_opener(proxy_handler) try: response = opener.open('https://www.baidu.com') print(response.read().decode('utf-8')) except URLError as e: print(e.reason)
Create a proxy Handler with ProxyHandler, and then use the Handler and build_ The Opener () method constructs an Opener, and then sends a request
Cookie:
import http.cookiejar,urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener.open('http://www.baidu.com') for item in items: print(item.name + "=" + item.value)
- Using HTTP CookieJar. CookieJar() creates a CookieJar object
- Then use HTTP cookieprocessor to create a Handler
- Finally, build is used_ The Opener () method constructs Opener
- Execute the open function to obtain the cookie of the corresponding website
Output cookie s to file format:
import http.cookiejar,urllib.request filename = 'cookies.txt' cookie = http.cookiejar.CookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener.open('http://www.baidu.com') cookie.save(ignore_discard=True,ignore_expires=True) #Save as LWP format file #cookie = http.cookejar.LWPCookieJar(filename)
Read cookie usage:
cookie = http.cookejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
Handling exceptions
- 1.URLError - inherits from OSError class and is the base class of error exception module. Exceptions generated by request module can be completed by capturing this class
- 2. The subclass of httperror urlerror is specially used to handle HTTP request errors, such as authentication request failure. It consists of three attributes:
- Code: return HTTP status code
- Reason: return the reason of the error
- headers: return request header
Because URLError is the parent class of HTTP error, you can choose to catch the errors of subclasses first, and then catch the errors of services At the same time, what reason returns is not necessarily a string, but also an object, such as socket Timeout timeout error
Resolve links
For a link:
http://www.baidu.com/index.html;user?id=5#comment
- : / / scheme in front represents the protocol
- The first / front is netloc, the domain name
- Followed by path, that is, the access path
- Semicolon; params in front represents parameters
- question mark? The latter is the query condition query, which is generally used as a GET type URL
- The anchor point # is behind the well
method | explain |
---|---|
urlparse() | Realize the identification and segmentation of URL, and return six parts |
urlunparse() | For the opposite method of urlparse, the parameters must be iteratable and the length must be 6, otherwise an exception will occur |
urlsplit() | Do not parse params separately (merge into path), return 5 parts |
urlunsplit() | Similar to urlunparse, the parameter must be iteratable, but the length must be 5, otherwise an exception will occur |
urljoin() | The first parameter is the basic chain, and the second parameter is the new link. This method will analyze the scheme,netloc and path of the basic chain, and then supplement the missing part of the new link, and finally return the result |
urlencode() | First, a dictionary is used to represent the parameters, then the urlencode() method is called to serialize it into the GET request parameter. |
parse_qs() | Deserialize and convert the GET request parameters back to the dictionary |
parse_qsl() | And parse_qs function is similar, but it is converted back to the list in the form of category |
quote() | Convert content to URL encoded format When the URL contains Chinese parameters, it may lead to the problem of garbled code. At this time, you can use quote to convert Chinese into URL code |
unquote() | URL decoding |
requests
GET request
import requests r = requests.get('http://httpbin.org/get') print(r.text)
The returned result contains the request body, URL,IP and other information
Add additional information
It can be written directly as:
import requests r = requests.get('http://httpbin.org/get?name=germey&age=22')
It can also be written as:
import requests data = { 'name':'germey', 'age':22 } r = requests.get("http://httpbin.org/get",params=data) print(r.text)
The return type of the web page is str, but it returns JSON format. Therefore, if you want to directly parse the return result into dictionary format, you can write as follows:
import requests r = requests.get("http://httpbin.org/get",params=data) print(r.json)
- Grab website
import requests import re headers = { 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" } r = requests.get("https://www.zhihu.com/explore",headers=headers) patten = re.compile('explore-feed.*?question_link.*>(.*?)</a>',re.S) titles = re.findall(patten,r.text) print(titles)
- Grab binary data
import requests r = requests.get("https://github.com/favicon.ico") print(r.text) #str type data print(r.context) #bytes type data
- Add headers
import requests headers = { 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" } r = requests.get("https://www.zhihu.com/explore",headers=headers) print(r.text)
You can add any other field information in the headers parameter
- POST request
import requests data = {'name':'germey','age':'22'} r = requests.post("http://httpbin.org/post",data=data) print(r.text)
- response
Properties of the object returned by the request method:
attribute | explain |
---|---|
status_code | Response status code |
headers | Response header |
cookies | Cookies |
url | URL |
history | Request history |
Requests also provides a built-in status code query object requests codes
Advanced Usage
-
File upload
import requests files = {'file':open('favicon',rb)} r = requests.post("http://httpbin.org/post",files=files) print(r.text)
The website will return a response, which contains the field of files, while the field of form is empty, which proves that the upload part of the file will be identified by a separate field of files
-
Cookies
Use requests to get Cookies
import requests r = requests.get("https://www.baidu.com") for key,value in r.cookies.items(): print(key + "=" + value)
You can call the cookie attribute to get Cookies successfully. You can find that it is of requestcookie jar type Use * * items() * * to convert it into a list composed of tuples
Set cookies
- The first method: log in to the corresponding website, copy the Cookie content in Headers, set it in Headers, and then send a request to the website as a parameter of get
- The second method: construct requestscookeiejar object:
import requests cookies = 'own cookies' jar = requests.cookies.RequestsCookieJar() headers = { 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" } for cookie in cookies.split(';'): key,value = cookie.split('=',1) jar.set(key,value) r = requests.get("https://www.zhihu.com",cookies=jar,headers=headers)
Undoubtedly, the second method is much more complicated than the first method, so the first method is generally used, and the second method can be understood temporarily
Session maintenance
-
Session maintenance
Use the session object to maintain the same session
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') r = s.get('http://httpbin.org/cookies') print(r.text)
Using Session can simulate the same Session without worrying about Cookies It is usually used to simulate the next operation after successful login
-
SSL certificate validation
When sending an HTTP request, it will check the CA certificate. When the website visited does not have a CA certificate, it will modify the value of * * verify (true by default) * * in the get() method:
import requests from requests.packages import urllib3 urllib3.disable_warnings() #Ignore warnings caused by no certificates #Or ignore the warning by capturing the warning to the log #logging.captureWarnings(True) response = requests.get('https:/www.12306.cn',verify=False) #Prevent certificate error reporting print(response.status_code)
-
Proxy settings
Base agent:
import requests proxies = { "http":"http://10.10.1.10:3128", "https":"https//10.10.1.10:1080" } requests.get("https://www.taobao.com",proxies=proxies)
Proxy uses HTTP Basic Auth:
import requests proxies = { "http":"http://user:password@10.10.1.10:3128" } requests.get("https://www.taobao.com",proxies=proxies)
The agent also supports SOCKS protocol:
import requests proxies = { "http":"socks5://user:password@host:port", "http":"socks5://user:password@host:port" } requests.get("https://www.taobao.com",proxies=proxies)
-
Timeout setting
import requests import socket try: r = requests.get("https://www.tapbao.com",timeout=1) except: print(e.reason) if isinstance(e.reason,socket.timeout): print('Time out')
In fact, the request is divided into two phases, connection and reading. The timeout set above will be used as the total timeout of connection and reading
If you want to specify them separately, you can pass in a tuple, such as (5,11,30)
If you want to wait permanently, you can directly set the timeout to None or leave it blank, because the default is None
-
identity authentication
Use the authentication provided by requests:
import requests from requests.auth import HTTPBasicAuth r = requests.get('http://localhost:5000',auth=HTTPBasicAuth('usename','password')) print(r.status_code) #If the user name and password are correct, it will be automatically authenticated when requesting, and 200 will be returned; Authentication failed, return 401
The above code can be abbreviated as:
import requests r = requests.get('http://localhost:5000',auth=('usename','password')) print(r.status_code)
requests also provide other authentication
- Prepared Request
The request is represented as a data structure, which is called Prepared Request
from requests import Request,Session url = 'http://httpbin.org/post' data = { 'name' = 'germey' } headers = { 'User-Agent':'...' } s = Session() req = Request('POST',url,data=data,headers=headers) preqqed = s.prepare_request(req) r = s.send(prepped) print(r.text)
- Introduce Request, and then construct a Request object with url,data and headers parameters
- Then call the prepare of the Session_ The request () method converts it into a Prepared Request object
- Then you call the send() method to send it.
As you can see, we have achieved the same POST request effect