Use of basic library

Posted by benwilhelm on Wed, 02 Feb 2022 04:08:11 +0100

Use of basic library

urllib

urllib library is Python's built-in HTTP request library, that is, it can be used without additional installation. It includes the following four modules:

  • Request: it is a basic HTTP request library, which can be used to simulate sending requests Just like entering the web address in the browser and then entering, you can simulate this process by passing in the URL and additional parameters to the library method
  • error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly
  • parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc
  • Robot parser: it is mainly used to identify the robots of the website Txt file, and then judge which websites can be crawled and which websites cannot be crawled

Send request

urllib. The urlopen() method in the request returns an object of type HTTPResponse, which contains the following methods:

methodexplain
read()Web page content return
getheaders()Returns the full header information of the response
getheader(name)Returns the value corresponding to the parameter name in the header information of the response

Properties:

attributeexplain
statusResponse status code
reasonCause of abnormality

Optional parameters of urlopen method:

parameterexplain
dataAdditional data (must be byte stream)
timeoutTimeout time (that is, timeout exception will be thrown after exceeding the set time)
contextMust be SSL Sslcontext type, used to specify SSL settings
cafileSpecify CA certificate
capathPath of CA certificate

Using urlib parse. URLEncode (dict, encoding ='utf-8 ') converts the parameter dictionary into a string, and then uses bytes (the parameter cannot be a dictionary) to convert the data into bytes for transmission

The requests class builds requests

import urllib.request
req = urllib.request.Request('https://python.org')
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

Parameters that can be received by Request:

parameterexplain
urlRequired parameters! URL for request
dataMust pass byte type
headersThe request header can be constructed when building the request, and add can also be used_ The header () method is added
origin_req_hostIt refers to the host name or IP address of the requestor
unverifiableIndicates whether the request cannot be verified. The default is false
methodUsed to indicate the methods used by the request, such as GET,POST, and PUT

The most common way to add request headers is to camouflage the browser by modifying the user agent. The default user agent is Python urlib. We can camouflage the browser by modifying it

Hangler subclass of BaseHandler class:

Subclassexplain
HTTPDefaultErrorHandlerIt is used to handle HTTP response errors. All errors will throw HTTPError type exceptions
HTTPRedirectHandlerUsed to handle redirection
HTTPCookieProcessorUsed to process Cookies
ProxyHandlerUsed to set the proxy. The default proxy is empty
HTTPPasswordMgrUsed to manage passwords. It maintains a table of user names and passwords
HTTPBasicAuthHandlerIt is used to manage authentication. If a link needs authentication when it is opened, it can be used to solve the authentication problem

Opener class

Opener class can be used to create opener objects, which can be used to complete deep configuration. The return type of opener's open() method is similar to urlopen() method

Use the Opener class and httpbasicauthendler to solve authentication:

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

usename = 'usename'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()  #Instantiate the HTTPPasswordMgrWithDefaultRealm object
p.add_password(None,url,usename,password)  #Add user name and password
auth_handler = HTTPBasicAuthHandler(p) #Instantiate HTTPBasicAuthHandler object with HTTPPasswordMgrWithDefaultRealm object
opener = build_opener(auth_handler)  #Use the established Handler to build an Opener object

try:
    result = Opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

Agent:

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler = ProxyHandler({
	'http':''
    'https':''
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

Create a proxy Handler with ProxyHandler, and then use the Handler and build_ The Opener () method constructs an Opener, and then sends a request

Cookie:

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = Opener.open('http://www.baidu.com')
for item in items:
	print(item.name + "=" + item.value)
  • Using HTTP CookieJar. CookieJar() creates a CookieJar object
  • Then use HTTP cookieprocessor to create a Handler
  • Finally, build is used_ The Opener () method constructs Opener
  • Execute the open function to obtain the cookie of the corresponding website

Output cookie s to file format:

import http.cookiejar,urllib.request

filename = 'cookies.txt'
cookie = http.cookiejar.CookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = Opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
#Save as LWP format file
#cookie = http.cookejar.LWPCookieJar(filename)

Read cookie usage:

cookie = http.cookejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = Opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

Handling exceptions

  • 1.URLError - inherits from OSError class and is the base class of error exception module. Exceptions generated by request module can be completed by capturing this class
  • 2. The subclass of httperror urlerror is specially used to handle HTTP request errors, such as authentication request failure. It consists of three attributes:
    • Code: return HTTP status code
    • Reason: return the reason of the error
    • headers: return request header

Because URLError is the parent class of HTTP error, you can choose to catch the errors of subclasses first, and then catch the errors of services At the same time, what reason returns is not necessarily a string, but also an object, such as socket Timeout timeout error

Resolve links

For a link:

http://www.baidu.com/index.html;user?id=5#comment

  • : / / scheme in front represents the protocol
  • The first / front is netloc, the domain name
  • Followed by path, that is, the access path
  • Semicolon; params in front represents parameters
  • question mark? The latter is the query condition query, which is generally used as a GET type URL
  • The anchor point # is behind the well
methodexplain
urlparse()Realize the identification and segmentation of URL, and return six parts
urlunparse()For the opposite method of urlparse, the parameters must be iteratable and the length must be 6, otherwise an exception will occur
urlsplit()Do not parse params separately (merge into path), return 5 parts
urlunsplit()Similar to urlunparse, the parameter must be iteratable, but the length must be 5, otherwise an exception will occur
urljoin()The first parameter is the basic chain, and the second parameter is the new link. This method will analyze the scheme,netloc and path of the basic chain, and then supplement the missing part of the new link, and finally return the result
urlencode()First, a dictionary is used to represent the parameters, then the urlencode() method is called to serialize it into the GET request parameter.
parse_qs()Deserialize and convert the GET request parameters back to the dictionary
parse_qsl()And parse_qs function is similar, but it is converted back to the list in the form of category
quote()Convert content to URL encoded format When the URL contains Chinese parameters, it may lead to the problem of garbled code. At this time, you can use quote to convert Chinese into URL code
unquote()URL decoding

requests

GET request

import requests

r = requests.get('http://httpbin.org/get')
print(r.text)

The returned result contains the request body, URL,IP and other information

Add additional information

It can be written directly as:

import requests

r = requests.get('http://httpbin.org/get?name=germey&age=22')

It can also be written as:

import requests

data = {
	'name':'germey',
	'age':22
}
r = requests.get("http://httpbin.org/get",params=data)
print(r.text)

The return type of the web page is str, but it returns JSON format. Therefore, if you want to directly parse the return result into dictionary format, you can write as follows:

import requests

r = requests.get("http://httpbin.org/get",params=data)
print(r.json)
  • Grab website
import requests
import re

headers = {
	'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}
r = requests.get("https://www.zhihu.com/explore",headers=headers)
patten = re.compile('explore-feed.*?question_link.*>(.*?)</a>',re.S)
titles = re.findall(patten,r.text)
print(titles)
  • Grab binary data
import requests

r = requests.get("https://github.com/favicon.ico")
print(r.text) #str type data
print(r.context) #bytes type data
  • Add headers
import requests

headers = {
	'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}
r = requests.get("https://www.zhihu.com/explore",headers=headers)
print(r.text)

You can add any other field information in the headers parameter

  • POST request
import requests

data = {'name':'germey','age':'22'}
r = requests.post("http://httpbin.org/post",data=data)
print(r.text)
  • response

Properties of the object returned by the request method:

attributeexplain
status_codeResponse status code
headersResponse header
cookiesCookies
urlURL
historyRequest history

Requests also provides a built-in status code query object requests codes

Advanced Usage

  • File upload

import requests

files = {'file':open('favicon',rb)}
r = requests.post("http://httpbin.org/post",files=files)
print(r.text)

The website will return a response, which contains the field of files, while the field of form is empty, which proves that the upload part of the file will be identified by a separate field of files

  • Cookies

Use requests to get Cookies

import requests

r = requests.get("https://www.baidu.com")
for key,value in r.cookies.items():
	print(key + "=" + value)

You can call the cookie attribute to get Cookies successfully. You can find that it is of requestcookie jar type Use * * items() * * to convert it into a list composed of tuples

Set cookies

  • The first method: log in to the corresponding website, copy the Cookie content in Headers, set it in Headers, and then send a request to the website as a parameter of get
  • The second method: construct requestscookeiejar object:
import requests

cookies = 'own cookies'
jar = requests.cookies.RequestsCookieJar()
headers = {
	'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}
for cookie in cookies.split(';'):
    key,value = cookie.split('=',1)
    jar.set(key,value)
r = requests.get("https://www.zhihu.com",cookies=jar,headers=headers)

Undoubtedly, the second method is much more complicated than the first method, so the first method is generally used, and the second method can be understood temporarily

Session maintenance

  • Session maintenance

Use the session object to maintain the same session

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

Using Session can simulate the same Session without worrying about Cookies It is usually used to simulate the next operation after successful login

  • SSL certificate validation

When sending an HTTP request, it will check the CA certificate. When the website visited does not have a CA certificate, it will modify the value of * * verify (true by default) * * in the get() method:

import requests
from requests.packages import urllib3

urllib3.disable_warnings()  #Ignore warnings caused by no certificates
#Or ignore the warning by capturing the warning to the log
#logging.captureWarnings(True)
response = requests.get('https:/www.12306.cn',verify=False)  #Prevent certificate error reporting
print(response.status_code)
  • Proxy settings

Base agent:

import requests

proxies = {
    "http":"http://10.10.1.10:3128",
    "https":"https//10.10.1.10:1080"
}
requests.get("https://www.taobao.com",proxies=proxies)

Proxy uses HTTP Basic Auth:

import  requests

proxies = {
    "http":"http://user:password@10.10.1.10:3128"
}
requests.get("https://www.taobao.com",proxies=proxies)

The agent also supports SOCKS protocol:

import requests

proxies = {
    "http":"socks5://user:password@host:port",
    "http":"socks5://user:password@host:port"
}
requests.get("https://www.taobao.com",proxies=proxies)
  • Timeout setting

import requests
import socket

try:
	r = requests.get("https://www.tapbao.com",timeout=1)
except:
    print(e.reason)
    if isinstance(e.reason,socket.timeout):
        print('Time out')

In fact, the request is divided into two phases, connection and reading. The timeout set above will be used as the total timeout of connection and reading

If you want to specify them separately, you can pass in a tuple, such as (5,11,30)

If you want to wait permanently, you can directly set the timeout to None or leave it blank, because the default is None

  • identity authentication

Use the authentication provided by requests:

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000',auth=HTTPBasicAuth('usename','password'))
print(r.status_code)
#If the user name and password are correct, it will be automatically authenticated when requesting, and 200 will be returned; Authentication failed, return 401

The above code can be abbreviated as:

import requests

r = requests.get('http://localhost:5000',auth=('usename','password'))
print(r.status_code)

requests also provide other authentication

  • Prepared Request

The request is represented as a data structure, which is called Prepared Request

from requests import Request,Session

url = 'http://httpbin.org/post'
data = {
    'name' = 'germey'
}
headers = {
    'User-Agent':'...'
}
s = Session()
req = Request('POST',url,data=data,headers=headers)
preqqed = s.prepare_request(req)
r = s.send(prepped)
print(r.text)
  • Introduce Request, and then construct a Request object with url,data and headers parameters
  • Then call the prepare of the Session_ The request () method converts it into a Prepared Request object
  • Then you call the send() method to send it.

As you can see, we have achieved the same POST request effect

Topics: crawler