Use of crawler basic library

Posted by killah on Sun, 16 Jan 2022 15:38:04 +0100

preface

Python's strength lies not only in its simplicity, but also in its full-featured and rich class libraries. Such as the most basic HTTP library urlib, requests, etc.

1, Introduction to urlib Library

Urlib library is Python's built-in HTTP request library, which includes the following four modules:

  1. Request: the most basic HTTP request module, which is used to simulate sending requests. At the same time, it also has the ability to handle authorization verification, redirection, cookies and so on
  2. error: exception handling module
  3. parse: provides many methods for URL processing. Such as split
  4. Robot parser: mainly used to identify the robots of the website Txt file

request part

Function part

1.urllib.request.urlopen(url,data=None,[timeout,],cafie=None,capath=None,context=None)
Parameter introduction:
		url: slightly
		data:Optional parameter. When you need to use this parameter, you must convert the data to bytes Type. It is also worth mentioning here bytes()Only support for str Convert. If your data is in dictionary form instead of string form[“ hello=word"]If you need to convert it to string form, you can consider using urllib.parse.urlencode(dict)Function.
		timeout:Set the timeout in s,Timeout return URLError. 
		Other parameters: omitted, useless and of little use. cafie=None,capath=None Specify separately CA The certificate and its path.
2.

Example:

import urllib.request
import urllib.parse
from urllib.error import URLError
data1=bytes("hello=word",encoding="utf-8")
print(data1)
try:
    response = urllib.request.urlopen("http://httpbin.org/post", data=data1,timeout=0.1)
    print(response.read())
except URLError as e:
    print(e)

explain:

  1. Pay attention to the mode of importing modules.
    Wrong behavior: import urllib and then use urllib when using the function request. urlopen.
    Correct behavior: import urlib request
  2. The object returned by response has the status attribute, the return status code and the read() method.
  3. response.read() returns the type bytes. In using it, it is worth noting

Request class

Urlopen this method sends requests not only through parameters, but also through the Request object. The method is to encapsulate the required parameters in the Request class. Then it is sent through urlopen(Request object). The benefit: urlopen can only send the most basic requests. Unable to build the complete Request. Using the Request class can help build.

urllib.request.Request(data=None, headers={ },
             origin_req_host=None, unverifiable=False, method=None)
	Parameter introduction:
		headers:Request header dictionary
		origin_req_host:Of the requesting party host Name or IP address
		unverifiable:Slightly, rarely used
		method:Request method, such as idempotent get method.

Example

import urllib.request
import urllib.parse
url='http://httpbin.org/post'
headers={
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67"
    ,'Host':'httpbin.org'
}
dict_={
    'name':'canglaoshi'
}
data=bytes(urllib.parse.urlencode(dict_),encoding='utf-8')#encoding parameter cannot be less
req=urllib.request.Request(url=url,data=data,headers=headers,method='POST')
response = urllib.request.urlopen(req,timeout=1)
print(response.read().decode("utf-8"))
'''
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "canglaoshi"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "15", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67", 
    "X-Amzn-Trace-Id": "Root=1-60f13813-44c50ab14d0e6cdb0be9c906"
  }, 
  "json": null, 
  "origin": "220.177.100.106", 
  "url": "http://httpbin.org/post"
}
'''

Handler

The above Request class still can't solve cookie, proxy and other problems. The tool Handler can help us solve it.
General usage:
1. Fill the parameters into the Handler class of the corresponding requirements.
2. Pass this object into build_opener() build opener
3. Call the Opener's open(url) to send the request.

Handler type:

	ProxyHandler:The parameters used to set the agent are dictionary and the key name is protocol type
	HTTPPasswoedMgr Used to manage passwords
	HTTPBasicAuthHandler:Used to manage authentication
	HTTPCookieProcessor:cookies
from urllib.request import
HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
from urllib.error import URLError
username='username'
password='password'
url='http://localhost:5000/'
p=HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler=HTTPBasicAuthHandler(p)
opener=build_opener(auth_handler)
try:
    response=opener.open(url)
    html=response.read().encode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

error exception handling section

URLError:url Errors such as opening a non-existent web page. Have properties reason Return error reason
HTTPError:URLError Subclass processing of HTTP Request error, such as authentication request failure. attribute code Return status code reason Return error reason headers Returns the request header.

Parse parse link part

It defines a standard interface for processing URLs, such as extracting and merging parts of URLs to convert.

function

  • urlparse(url) realizes the identification and segmentation of the URL, and returns a ParseResult object, including six parts: protocol, domain name, access path, parameters and? The following query criteria.
from urllib.parse import urlparse
result=urlparse("https:www.baidu.com//index.html;user?id=5")
print(result)#ParseResult(scheme='https', netloc='', path='www.baidu.com//index.html', params='user', query='id=5', fragment='')
  • urlunparse(list) corresponds to 1. The parameter sequence must be six elements
  • Urlplit () is similar to 1, and the result combines the access path with the parameters and.
  • urlunsplit corresponds to 3.
  • urljoin(base_url,new_url) parsing base_ The protocol, domain name and access path of the URL are new_ Missing part of URL for supplement
  • urlencode: construct dictionary elements in the form of "key=value".

Robots protocol section

Robots are also called crawler protocol and robot protocol. Tell crawlers and search engines which pages can be crawled. It is usually a robot Txt file.
Its usage: omitted

Requests Library

The use of urllib is very inconvenient. In order to operate more conveniently, you can use the requests library.

method

1.requests.get (url, data=None, headers={ },proxies,
             [,timeout])Pay attention here data No conversion required bytes Type, dictionary is OK.
             Its return object has status_code,headers,cookies,url,history Request history text ,content And other attributes, but it is worth noting that there are two attributes text ,content Properties,
             Both obtain the content of the returned object. The former is text and the latter is bytes. [cookies Write in headers in],Note: not all parameters are included here.
2.Similarly, there are post method
3.requests Object also has built-in status code query object request.codes. Specific use request.codes.ok==200.

Example: get blog Garden Icon

import requests
r=requests.get('https://www.cnblogs.com/images/logo.svg?v=R9M0WmLAIPVydmdzE2keuvnjl-bPR7_35oHqtiBzGsM')
with open("bokeyuan.svg", "wb") as f:
    f.write(r.content)

Use example

File upload

import requests
files={"file":open('bokeyuan.svg','rb')}
r=requests.post('http://httpbin.org/post',files=files)

Authentication

import requests
r=requests.get("xxxx",auth=('username','password'))

Prepared [prɪˈpeəd] requests

from requests import Request,session
url='http://httpbin.org/post'
data={"name":"alex"}
headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67",

}
s=session()
req=Request('POST',url,data=data,headers=headers)
prepped=s.prepare_request(req)
r=s.send(prepped)
print(r.text)

regular

Online regular expression test (oschina.net) matching tool , but not recommended.

Xpath parsing library

String matching can be regular. Although web pages also belong to strings, it is not easy to extract the information. The use of parsing library can solve this problem. The parsing libraries include Xpath, Beautiful Soup and pyqurey
The selection function of Xpath is very powerful. It provides a concise path selection expression. Its expressions and functions can match all nodes.

Steps for using Xpath

1.structure Xpath Resolve object:
	(1)Import lxml Library etree Module, and then declare a paragraph HTML Text,
call HTML Class[ etree.HTML(html text)],
If HTML The text is incomplete and needs to be completed etree.tostring(html class). 
	(2)Direct read file etree.parse(htmlfilepath,etree.HTMLPareser())To construct the resolution object.
2.Parse object.xpath(expression)

Example

text='''<ul>
<li><a>this is a</a></li>
<li class='li2'></li>

</ul>'''
from lxml import etree
html=etree.HTML(text)
# result=etree.tostring(html)
s=html.xpath("//ul//a/text()")
print(s[0])# this is a

Common rules for Xpath

		nodename			Select all children of this node
		/					Select direct child node from current node
		//					Select a descendant node from the current child node
		@					Select Properties		
		text()				Get text		
		*					Match all nodes
		.. perhaps parent:: 		Parent node
 Some descriptions and examples of rules:
	When writing expressions, you must pay attention to writing expressions and html Whether the text structure matches. If obtained ul Under label a label.//ul/a cannot match the a tag because the tag under ul is the li tag.
	Attribute matching: use the attributes of the tag for matching[@Attribute name=condition]Such as matching ul Satisfaction under class Attribute is classa of a label.//ul//a[@class=’classa’].  However, if there is more than one class attribute value of a tag that meets the classA condition, such as < a class = "classA ClassB" > < / a >. If the opportunity finds that the a tag that meets the classA condition cannot be matched, you can use [contains(@class, 'classA')]. In fact, operators such as and or are also supported in []
	Get properties:@attribute
	Select in order:[num]The first num Number[last()]Last[position()<num]Position less than num. Use: match last li. //li[2] or / / li[last()]				

Knowledge supplement

html properties

HTML tags can have attributes. Property provides more information about HTML elements.
Attributes always appear as name / value pairs, for example: name = "value".
Attributes are always specified in the start tag of HTML elements.

http://httpbin.org/xx What website is it

Httpbin is a free HTTP request and response testing website.

Topics: Python crawler