preface
Python's strength lies not only in its simplicity, but also in its full-featured and rich class libraries. Such as the most basic HTTP library urlib, requests, etc.
1, Introduction to urlib Library
Urlib library is Python's built-in HTTP request library, which includes the following four modules:
- Request: the most basic HTTP request module, which is used to simulate sending requests. At the same time, it also has the ability to handle authorization verification, redirection, cookies and so on
- error: exception handling module
- parse: provides many methods for URL processing. Such as split
- Robot parser: mainly used to identify the robots of the website Txt file
request part
Function part
1.urllib.request.urlopen(url,data=None,[timeout,],cafie=None,capath=None,context=None) Parameter introduction: url: slightly data:Optional parameter. When you need to use this parameter, you must convert the data to bytes Type. It is also worth mentioning here bytes()Only support for str Convert. If your data is in dictionary form instead of string form[“ hello=word"]If you need to convert it to string form, you can consider using urllib.parse.urlencode(dict)Function. timeout:Set the timeout in s,Timeout return URLError. Other parameters: omitted, useless and of little use. cafie=None,capath=None Specify separately CA The certificate and its path. 2.
Example:
import urllib.request import urllib.parse from urllib.error import URLError data1=bytes("hello=word",encoding="utf-8") print(data1) try: response = urllib.request.urlopen("http://httpbin.org/post", data=data1,timeout=0.1) print(response.read()) except URLError as e: print(e)
explain:
- Pay attention to the mode of importing modules.
Wrong behavior: import urllib and then use urllib when using the function request. urlopen.
Correct behavior: import urlib request - The object returned by response has the status attribute, the return status code and the read() method.
- response.read() returns the type bytes. In using it, it is worth noting
Request class
Urlopen this method sends requests not only through parameters, but also through the Request object. The method is to encapsulate the required parameters in the Request class. Then it is sent through urlopen(Request object). The benefit: urlopen can only send the most basic requests. Unable to build the complete Request. Using the Request class can help build.
urllib.request.Request(data=None, headers={ }, origin_req_host=None, unverifiable=False, method=None) Parameter introduction: headers:Request header dictionary origin_req_host:Of the requesting party host Name or IP address unverifiable:Slightly, rarely used method:Request method, such as idempotent get method.
Example
import urllib.request import urllib.parse url='http://httpbin.org/post' headers={ 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67" ,'Host':'httpbin.org' } dict_={ 'name':'canglaoshi' } data=bytes(urllib.parse.urlencode(dict_),encoding='utf-8')#encoding parameter cannot be less req=urllib.request.Request(url=url,data=data,headers=headers,method='POST') response = urllib.request.urlopen(req,timeout=1) print(response.read().decode("utf-8")) ''' { "args": {}, "data": "", "files": {}, "form": { "name": "canglaoshi" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "15", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67", "X-Amzn-Trace-Id": "Root=1-60f13813-44c50ab14d0e6cdb0be9c906" }, "json": null, "origin": "220.177.100.106", "url": "http://httpbin.org/post" } '''
Handler
The above Request class still can't solve cookie, proxy and other problems. The tool Handler can help us solve it.
General usage:
1. Fill the parameters into the Handler class of the corresponding requirements.
2. Pass this object into build_opener() build opener
3. Call the Opener's open(url) to send the request.
Handler type:
ProxyHandler:The parameters used to set the agent are dictionary and the key name is protocol type HTTPPasswoedMgr Used to manage passwords HTTPBasicAuthHandler:Used to manage authentication HTTPCookieProcessor:cookies
from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener from urllib.error import URLError username='username' password='password' url='http://localhost:5000/' p=HTTPPasswordMgrWithDefaultRealm() p.add_password(None,url,username,password) auth_handler=HTTPBasicAuthHandler(p) opener=build_opener(auth_handler) try: response=opener.open(url) html=response.read().encode('utf-8') print(html) except URLError as e: print(e.reason)
error exception handling section
URLError:url Errors such as opening a non-existent web page. Have properties reason Return error reason HTTPError:URLError Subclass processing of HTTP Request error, such as authentication request failure. attribute code Return status code reason Return error reason headers Returns the request header.
Parse parse link part
It defines a standard interface for processing URLs, such as extracting and merging parts of URLs to convert.
function
- urlparse(url) realizes the identification and segmentation of the URL, and returns a ParseResult object, including six parts: protocol, domain name, access path, parameters and? The following query criteria.
from urllib.parse import urlparse result=urlparse("https:www.baidu.com//index.html;user?id=5") print(result)#ParseResult(scheme='https', netloc='', path='www.baidu.com//index.html', params='user', query='id=5', fragment='')
- urlunparse(list) corresponds to 1. The parameter sequence must be six elements
- Urlplit () is similar to 1, and the result combines the access path with the parameters and.
- urlunsplit corresponds to 3.
- urljoin(base_url,new_url) parsing base_ The protocol, domain name and access path of the URL are new_ Missing part of URL for supplement
- urlencode: construct dictionary elements in the form of "key=value".
Robots protocol section
Robots are also called crawler protocol and robot protocol. Tell crawlers and search engines which pages can be crawled. It is usually a robot Txt file.
Its usage: omitted
Requests Library
The use of urllib is very inconvenient. In order to operate more conveniently, you can use the requests library.
method
1.requests.get (url, data=None, headers={ },proxies, [,timeout])Pay attention here data No conversion required bytes Type, dictionary is OK. Its return object has status_code,headers,cookies,url,history Request history text ,content And other attributes, but it is worth noting that there are two attributes text ,content Properties, Both obtain the content of the returned object. The former is text and the latter is bytes. [cookies Write in headers in],Note: not all parameters are included here. 2.Similarly, there are post method 3.requests Object also has built-in status code query object request.codes. Specific use request.codes.ok==200.
Example: get blog Garden Icon
import requests r=requests.get('https://www.cnblogs.com/images/logo.svg?v=R9M0WmLAIPVydmdzE2keuvnjl-bPR7_35oHqtiBzGsM') with open("bokeyuan.svg", "wb") as f: f.write(r.content)
Use example
File upload
import requests files={"file":open('bokeyuan.svg','rb')} r=requests.post('http://httpbin.org/post',files=files)
Authentication
import requests r=requests.get("xxxx",auth=('username','password'))
Prepared [prɪˈpeəd] requests
from requests import Request,session url='http://httpbin.org/post' data={"name":"alex"} headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67", } s=session() req=Request('POST',url,data=data,headers=headers) prepped=s.prepare_request(req) r=s.send(prepped) print(r.text)
regular
Online regular expression test (oschina.net) matching tool , but not recommended.
Xpath parsing library
String matching can be regular. Although web pages also belong to strings, it is not easy to extract the information. The use of parsing library can solve this problem. The parsing libraries include Xpath, Beautiful Soup and pyqurey
The selection function of Xpath is very powerful. It provides a concise path selection expression. Its expressions and functions can match all nodes.
Steps for using Xpath
1.structure Xpath Resolve object: (1)Import lxml Library etree Module, and then declare a paragraph HTML Text, call HTML Class[ etree.HTML(html text)], If HTML The text is incomplete and needs to be completed etree.tostring(html class). (2)Direct read file etree.parse(htmlfilepath,etree.HTMLPareser())To construct the resolution object. 2.Parse object.xpath(expression)
Example
text='''<ul> <li><a>this is a</a></li> <li class='li2'></li> </ul>''' from lxml import etree html=etree.HTML(text) # result=etree.tostring(html) s=html.xpath("//ul//a/text()") print(s[0])# this is a
Common rules for Xpath
nodename Select all children of this node / Select direct child node from current node // Select a descendant node from the current child node @ Select Properties text() Get text * Match all nodes .. perhaps parent:: Parent node Some descriptions and examples of rules: When writing expressions, you must pay attention to writing expressions and html Whether the text structure matches. If obtained ul Under label a label.//ul/a cannot match the a tag because the tag under ul is the li tag. Attribute matching: use the attributes of the tag for matching[@Attribute name=condition]Such as matching ul Satisfaction under class Attribute is classa of a label.//ul//a[@class=’classa’]. However, if there is more than one class attribute value of a tag that meets the classA condition, such as < a class = "classA ClassB" > < / a >. If the opportunity finds that the a tag that meets the classA condition cannot be matched, you can use [contains(@class, 'classA')]. In fact, operators such as and or are also supported in [] Get properties:@attribute Select in order:[num]The first num Number[last()]Last[position()<num]Position less than num. Use: match last li. //li[2] or / / li[last()]
Knowledge supplement
html properties
HTML tags can have attributes. Property provides more information about HTML elements.
Attributes always appear as name / value pairs, for example: name = "value".
Attributes are always specified in the start tag of HTML elements.
http://httpbin.org/xx What website is it
Httpbin is a free HTTP request and response testing website.