1.urllib Library
1.1 basic use
Use urllib to get the source code of Baidu home page
import urllib.request # 1. Define a url that is the address you want to visit url = 'http://www.baidu.com' # 2. Simulate the browser to send a request response to the server response = urllib.request.urlopen(url) # 3. Get the source code of the page in the response content = response.read().decode('utf-8') # 4. Print data print(content)
The read method returns binary data in byte form. To convert the binary data into a string, we need to decode: decode('encoded format ')
1.2 1 type and 6 methods
import urllib.request url = 'http://www.baidu.com' # Impersonate the browser to send a request to the server response = urllib.request.urlopen(url) # One type: response is the type of HTTP response print(type(response)) # Read byte by byte content = response.read() print(content) # How many bytes are returned content = response.read(5) print(content) # Read one line content = response.readline() print(content) # Read line by line until the end content = response.readlines() print(content) # If the returned status code is 200, it proves that our logic is correct print(response.getcode()) # The url address is returned print(response.geturl()) # Get is a status information print(response.getheaders())
One type: HTTP response six methods: read, readline, readlines, getcode, geturl, getheaders
1.3 download
import urllib.request # Download Web page url_page = 'http://www.baidu.com' # The url represents the download path filename and the name of the file urllib.request.urlretrieve(url_page,'baidu.html') # Download pictures url_img = 'https://img1.baidu.com/it/u=3004965690,4089234593&fm=26&fmt=auto&gp=0.jpg' urllib.request.urlretrieve(url= url_img,filename='lisa.jpg') # Download Video url_video = 'https://vd3.bdstatic.com/mda-mhkku4ndaka5etk3/1080p/cae_h264/1629557146541497769/mda-mhkku4ndaka5etk3.mp4?v_from_s=hkapp-haokan-tucheng&auth_key=1629687514-0-0-7ed57ed7d1168bb1f06d18a4ea214300&bcevod_channel=searchbox_feed&pd=1&pt=3&abtest=' urllib.request.urlretrieve(url_video,'hxekyyds.mp4')
In python, you can write the name of the variable or write the value directly
1.4 customization of request object
import urllib.request url = 'https://www.baidu.com' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } # Because the dictionary cannot be stored in the urlopen method, headers cannot be passed in # Customization of request object request = urllib.request.Request(url=url,headers=headers) response = urllib.request.urlopen(request) content = response.read().decode('utf8') print(content)
1.5 quote method of get request
If the get request parameter is Chinese, the Chinese needs to be encoded, as shown below. If it is not encoded, an error will be reported.
Demand acquisition https://www.baidu.com/s?wd= Jay Chou's web page source code is as follows: https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6
import urllib.request import urllib.parse url = 'https://www.baidu.com/s?wd=' # The customization of request object is the first means to solve anti crawling headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } # Turning Jay Chou into a unicode encoded format depends on urllib parse name = urllib.parse.quote('Jay Chou') # Splice the transcoded string to the back of the path url = url + name # Customization of request object request = urllib.request.Request(url=url,headers=headers) # Impersonate the browser to send a request to the server response = urllib.request.urlopen(request) # Get the content of the response content = response.read().decode('utf-8') # print data print(content)
quote is suitable for transcoding Chinese into Unicode
1.6 urlencode method of get request
urlencode application scenario: when there are multiple parameters. as follows
https://www.baidu.com/s?wd= Jay Chou & sex = male
# obtain https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7 Web source code import urllib.request import urllib.parse base_url = 'https://www.baidu.com/s?' data = { 'wd':'Jay Chou', 'sex':'male', 'location':'Taiwan Province, China' } new_data = urllib.parse.urlencode(data) # Request resource path url = base_url + new_data headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } # Customization of request object request = urllib.request.Request(url=url,headers=headers) # Impersonate the browser to send a request to the server response = urllib.request.urlopen(request) # Get the data of web source code content = response.read().decode('utf-8') # print data print(content)
1.7 post request
import urllib.request import urllib.parse url = 'https://fanyi.baidu.com/sug' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } data = { 'kw':'spider' } # The parameters of the post request must be encoded data = urllib.parse.urlencode(data).encode('utf-8') request = urllib.request.Request(url=url,data=data,headers=headers) # Impersonate the browser to send a request to the server response = urllib.request.urlopen(request) # Get response data content = response.read().decode('utf-8') # String -- json object import json obj = json.loads(content) print(obj)
The parameters of the post request must be encoded: data = urllib parse. URLEncode (data) you must call the encode method after undefined encoding: data = urllib parse. urlencode(data). The parameters of the encode ('utf-8 ') undefined post request are not spliced after the URL, but need to be placed in the customized parameters of the request object: undefined request = urllib request. Request(url=url,data=data,headers=headers)
1.8 abnormality
import urllib.request import urllib.error url = 'http://www.doudan1.com' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } try: request = urllib.request.Request(url = url, headers = headers) response = urllib.request.urlopen(request) content = response.read().decode('utf-8') print(content) except urllib.error.HTTPError: print('The system is being upgraded...') except urllib.error.URLError: print('I said the system is being upgraded...')
1.9 handler
Why learn handler?
- urllib.request.urlopen(url) cannot customize the request header
- urllib.request.Request(url,headers,data) can customize the request header
- Handler: Customize more advanced request headers (with the complexity of business logic, the customization of request objects can not meet our needs, and dynamic cookie s and proxies cannot use the customization of request objects)
# Need to use handler to visit Baidu to get the web page source code import urllib.request url = 'http://www.baidu.com' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } request = urllib.request.Request(url = url,headers = headers) # handler build_opener open #(1) Get hanlder object handler = urllib.request.HTTPHandler() #(2) Get opener object opener = urllib.request.build_opener(handler) # (3) Call the open method response = opener.open(request) content = response.read().decode('utf-8') print(content)
1.10 agency
Why do you need an agent? Because some websites are forbidden to crawl, if you use real ip to crawl, it is easy to be blocked.
import urllib.request url = 'http://www.baidu.com/s?wd=ip' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } # Customization of request object request = urllib.request.Request(url = url,headers= headers) # Impersonate browser access server # response = urllib.request.urlopen(request) proxies = { 'http':'118.24.219.151:16817' } # handler build_opener open handler = urllib.request.ProxyHandler(proxies = proxies) opener = urllib.request.build_opener(handler) response = opener.open(request) # Get response information content = response.read().decode('utf-8') # preservation with open('daili.html','w',encoding='utf-8')as fp: fp.write(content)
Agents can use: fast agents. You can use an agent pool instead of an agent
2. Analytical technology
2.1 xpath
xpath installation and loading
1. Install lxml Library
pip install lxml ‐i https://pypi.douban.com/simple
2. Import lxml etree
from lxml import etree
3.etree.parse() parses local files
html_tree = etree.parse('XX.html')
4.etree.HTML() server response file
html_tree = etree.HTML(response.read().decode('utf‐8')
5. Parse and obtain DOM elements
html_tree.xpath(xpath path)
According to the chrome plug-in of xpath, use ctrl + shift + x to open the plug-in
Basic xpath syntax
1. Path query
//: find all descendant nodes regardless of hierarchy. undefined /: find direct child nodes
2. Predicate query
//div[@id]undefined//div[@id="maincontent"]
3. Attribute query
//@class
4. Fuzzy query
//div[contains(@id, "he")]undefined//div[starts‐with(@id, "he")]
5. Content query
//div/h1/text()
6. Logical operation
//div[@id="head" and @class="s_down"]undefined//title | //price
Example:
from lxml import etree # xpath parsing local files tree = etree.parse('test.html') # Find the li below li_list = tree.xpath('//body/ul/li') # Find the li tag of all attributes with id, and text() gets the content in the tag li_list = tree.xpath('//ul/li[@id]/text()') # Find the li tag with id l1. Pay attention to the quotation marks li_list = tree.xpath('//ul/li[@id="l1"]/text()') # Find the attribute value of the class of the li tag with id l1 li = tree.xpath('//ul/li[@id="l1"]/@class') # The query id contains the li tag of l li_list = tree.xpath('//ul/li[contains(@id,"l")]/text()') # The value of the query id is the li tag starting with l li_list = tree.xpath('//ul/li[starts-with(@id,"c")]/text()') #Query with id l1 and class c1 li_list = tree.xpath('//ul/li[@id="l1" and @class="c1"]/text()') li_list = tree.xpath('//ul/li[@id="l1"]/text() | //ul/li[@id="l2"]/text()')
2.2 JsonPath
JsonPath can only parse local files.
Installation and use of jsonpath
pip installation:
pip install jsonpath
Use of jsonpath:
obj = json.load(open('json file ',' r ', encoding =' utf ‐ 8 ')) undefined set = jsonpath Jsonpath (obj, 'jsonpath syntax')
Example:
{ "store": { "book": [ { "category": "Xiuzhen", "author": "Liudao", "title": "How do bad guys practice", "price": 8.95 }, { "category": "Xiuzhen", "author": "Silkworm potato", "title": "Break through the sky", "price": 12.99 }, { "category": "Xiuzhen", "author": "Tang family San Shao", "title": "Douluo continent", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "Xiuzhen", "author": "Third uncle of Southern Sect", "title": "Star change", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "author": "Old horse", "color": "black", "price": 19.95 } } }
Parse the above json data. For specific syntax, refer to the following blog:
https://blog.csdn.net/luxideyao/article/details/77802389
import json import jsonpath obj = json.load(open('jsonpath.json','r',encoding='utf-8')) # The author of all the books in the bookstore author_list = jsonpath.jsonpath(obj,'$.store.book[*].author') # All authors author_list = jsonpath.jsonpath(obj,'$..author') # All elements under store tag_list = jsonpath.jsonpath(obj,'$.store.*') # price of everything in the store price_list = jsonpath.jsonpath(obj,'$.store..price') # The third book book = jsonpath.jsonpath(obj,'$..book[2]') # The last book book = jsonpath.jsonpath(obj,'$..book[(@.length-1)]') # The first two books book_list = jsonpath.jsonpath(obj,'$..book[0,1]') book_list = jsonpath.jsonpath(obj,'$..book[:2]') # For conditional filtering, you need to add a filter in front of ()? # Filter out all books containing isbn. book_list = jsonpath.jsonpath(obj,'$..book[?(@.isbn)]') # Which book is more than 10 yuan book_list = jsonpath.jsonpath(obj,'$..book[?(@.price>10)]')
2.3 BeautifulSoup
Basic introduction
- Beautiful soup abbreviation: bs4
- What is BeatifulSoup? Beautiful soup, like lxml, is an html parser. Its main function is to parse and extract data
- Advantages and disadvantages
Disadvantages: no efficiency lxml High efficiency
Advantages: the interface design is humanized and easy to use
Install and create
- install
> pip install bs4 -i https://pypi.douban.com/simple
- Import
> from bs4 import BeautifulSoup
- create object
* File generation object for server response
> soup = BeautifulSoup(response.read().decode(), 'lxml')
* Local file generation object
> soup = BeautifulSoup(open('1.html'), 'lxml')
Note: the encoding format of the open file is gbk by default, so you need to specify the opening encoding format
Node location
1. Find nodes by tag name
soup.a # can only find the first auddefinedsoup a.nameundefinedsoup. a.attrs
2. Function
- Find (returns an object)
> find('a'): Only the first one was found a label > > find('a', title='name') > > find('a', class\_='name')
- find_ All (return a list)
> find\_all('a') : Find all a > > find\_all(\['a', 'span'\]) Return all a and span > > find\_all('a', limit=2) Just the first two a
- Select (get the node object according to the selector) [☆☆]
* `element`: p * `.class`: .firstname * `#id`: #firstname * `attribute selectors `: `[attribute]`: li = soup.select('li\[class\]') `[attribute=value]`: li = soup.select('li\[class="hengheng1"\]') * `Level selector`: div p Descendant Selectors div>p Descendant selector: the first child label of a label div,p div or p Label all objects
Node information
- Get node content: applicable to the structure of nested labels in labels
> obj.string > > obj.get\_text()[[recommended]
- Properties of nodes
> tag.name: Get tag name > > tag.attrs: Returns the property value as a dictionary
- Get node properties
> obj.attrs.get('title')[[common] > > obj.get('title') > > obj\['title'\]
Example:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <div> <ul> <li id="l1">Zhang San</li> <li id="l2">Li Si</li> <li>Wang Wu</li> <a href="" id="" class="a1">Shang Silicon Valley</a> <span>Hey, hey, hey</span> </ul> </div> <a href="" title="a2">Baidu</a> <div id="d1"> <span> Ha ha ha </span> </div> <p id="p1" class="p1">Interesting</p> </body> </html>
Parse the above html using BeautifulSoup
from bs4 import BeautifulSoup # By default, the encoding format of the open file is gbk, so you need to specify the encoding when opening the file soup = BeautifulSoup(open('bs4 Basic use of.html',encoding='utf-8'),'lxml') # Find the node according to the tag name, and the first qualified data is found print(soup.a) # Gets the properties and property values of the tag print(soup.a.attrs) # Some functions of bs4 # (1) find: returns the first qualified data print(soup.find('a')) # Find the corresponding label object according to the value of title print(soup.find('a',title="a2")) # Find the corresponding label object according to the value of class. Note that class needs to be underlined print(soup.find('a',class_="a1")) # (2)find_all returns a list and all a tags print(soup.find_all('a')) # If you want to get the data of multiple tags, you need to find_ The data of the list is added to the parameter of all print(soup.find_all(['a','span'])) # The function of limit is to find the first few data print(soup.find_all('li',limit=2)) # (3) select (recommended) # The select method returns a list and multiple data print(soup.select('a')) # Yes On behalf of class, we call this operation class selector print(soup.select('.a1')) print(soup.select('#l1')) # Attribute selector: find the corresponding label through the attribute # Find the tag with id in the li tag print(soup.select('li[id]')) # Find the label with id l2 in the li label print(soup.select('li[id="l2"]')) # Level selector # Descendant selector: find li under div print(soup.select('div li')) # Descendant selector: the first child label of a label print(soup.select('div > ul > li')) # Find all the objects of a tag and li tag print(soup.select('a,li')) # Get node content obj = soup.select('#d1')[0] # If there is only content in the tag object, then string and get_text() can be used # If there are tags in the tag object in addition to the content, the string will not get the data, but get_text() is the data that can be obtained # Get is recommended_ text() print(obj.string) print(obj.get_text()) # Properties of nodes obj = soup.select('#p1')[0] # Name is the name of the tag print(obj.name) # Returns a dictionary around the attribute value print(obj.attrs) # Gets the properties of the node print(obj.attrs.get('class')) print(obj.get('class')) print(obj['class'])