[python crawler] urllib usage and page parsing

Posted by Anarchatos on Sun, 12 Dec 2021 03:46:10 +0100

1.urllib Library

1.1 basic use

Use urllib to get the source code of Baidu home page

import urllib.request

# 1. Define a url that is the address you want to visit
url = 'http://www.baidu.com'

# 2. Simulate the browser to send a request response to the server
response = urllib.request.urlopen(url)

# 3. Get the source code of the page in the response
content = response.read().decode('utf-8')

# 4. Print data
print(content)

The read method returns binary data in byte form. To convert the binary data into a string, we need to decode: decode('encoded format ')

1.2 1 type and 6 methods

import urllib.request

url = 'http://www.baidu.com'

# Impersonate the browser to send a request to the server
response = urllib.request.urlopen(url)

# One type: response is the type of HTTP response
print(type(response))

# Read byte by byte
content = response.read()
print(content)

# How many bytes are returned
content = response.read(5)
print(content)

# Read one line
content = response.readline()
print(content)

# Read line by line until the end
content = response.readlines()
print(content)

# If the returned status code is 200, it proves that our logic is correct
print(response.getcode())

# The url address is returned
print(response.geturl())

# Get is a status information
print(response.getheaders())

One type: HTTP response six methods: read, readline, readlines, getcode, geturl, getheaders

1.3 download

import urllib.request

# Download Web page
url_page = 'http://www.baidu.com'

# The url represents the download path filename and the name of the file
urllib.request.urlretrieve(url_page,'baidu.html')

# Download pictures
url_img = 'https://img1.baidu.com/it/u=3004965690,4089234593&fm=26&fmt=auto&gp=0.jpg'
urllib.request.urlretrieve(url= url_img,filename='lisa.jpg')

# Download Video
url_video = 'https://vd3.bdstatic.com/mda-mhkku4ndaka5etk3/1080p/cae_h264/1629557146541497769/mda-mhkku4ndaka5etk3.mp4?v_from_s=hkapp-haokan-tucheng&auth_key=1629687514-0-0-7ed57ed7d1168bb1f06d18a4ea214300&bcevod_channel=searchbox_feed&pd=1&pt=3&abtest='

urllib.request.urlretrieve(url_video,'hxekyyds.mp4')

In python, you can write the name of the variable or write the value directly

1.4 customization of request object

import urllib.request

url = 'https://www.baidu.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# Because the dictionary cannot be stored in the urlopen method, headers cannot be passed in
# Customization of request object
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf8')
print(content)

1.5 quote method of get request

If the get request parameter is Chinese, the Chinese needs to be encoded, as shown below. If it is not encoded, an error will be reported.

Demand acquisition https://www.baidu.com/s?wd= Jay Chou's web page source code is as follows: https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6

import urllib.request
import urllib.parse

url = 'https://www.baidu.com/s?wd='

# The customization of request object is the first means to solve anti crawling
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# Turning Jay Chou into a unicode encoded format depends on urllib parse
name = urllib.parse.quote('Jay Chou')

# Splice the transcoded string to the back of the path
url = url + name

# Customization of request object
request = urllib.request.Request(url=url,headers=headers)

# Impersonate the browser to send a request to the server
response = urllib.request.urlopen(request)

# Get the content of the response
content = response.read().decode('utf-8')

# print data
print(content)

quote is suitable for transcoding Chinese into Unicode

1.6 urlencode method of get request

urlencode application scenario: when there are multiple parameters. as follows

https://www.baidu.com/s?wd= Jay Chou & sex = male

# obtain https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7 Web source code

import urllib.request
import urllib.parse

base_url = 'https://www.baidu.com/s?'

data = {
    'wd':'Jay Chou',
    'sex':'male',
    'location':'Taiwan Province, China'
}

new_data = urllib.parse.urlencode(data)

# Request resource path
url = base_url + new_data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# Customization of request object
request = urllib.request.Request(url=url,headers=headers)

# Impersonate the browser to send a request to the server
response = urllib.request.urlopen(request)

# Get the data of web source code
content = response.read().decode('utf-8')

# print data
print(content)

1.7 post request

import urllib.request
import urllib.parse

url = 'https://fanyi.baidu.com/sug'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

data = {
    'kw':'spider'
}

# The parameters of the post request must be encoded
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,data=data,headers=headers)

# Impersonate the browser to send a request to the server
response = urllib.request.urlopen(request)

# Get response data
content = response.read().decode('utf-8')

# String -- json object
import json
obj = json.loads(content)
print(obj)

The parameters of the post request must be encoded: data = urllib parse. URLEncode (data) you must call the encode method after undefined encoding: data = urllib parse. urlencode(data). The parameters of the encode ('utf-8 ') undefined post request are not spliced after the URL, but need to be placed in the customized parameters of the request object: undefined request = urllib request. Request(url=url,data=data,headers=headers)

1.8 abnormality

import urllib.request
import urllib.error

url = 'http://www.doudan1.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

try:
    request = urllib.request.Request(url = url, headers = headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    print(content)
except urllib.error.HTTPError:
    print('The system is being upgraded...')
except urllib.error.URLError:
    print('I said the system is being upgraded...')

1.9 handler

Why learn handler?

  • urllib.request.urlopen(url) cannot customize the request header
  • urllib.request.Request(url,headers,data) can customize the request header
  • Handler: Customize more advanced request headers (with the complexity of business logic, the customization of request objects can not meet our needs, and dynamic cookie s and proxies cannot use the customization of request objects)
# Need to use handler to visit Baidu to get the web page source code

import urllib.request

url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

request = urllib.request.Request(url = url,headers = headers)

# handler   build_opener  open

#(1) Get hanlder object
handler = urllib.request.HTTPHandler()

#(2) Get opener object
opener = urllib.request.build_opener(handler)

# (3) Call the open method
response = opener.open(request)
content = response.read().decode('utf-8')
print(content)

1.10 agency

Why do you need an agent? Because some websites are forbidden to crawl, if you use real ip to crawl, it is easy to be blocked.

import urllib.request

url = 'http://www.baidu.com/s?wd=ip'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# Customization of request object
request = urllib.request.Request(url = url,headers= headers)

# Impersonate browser access server
# response = urllib.request.urlopen(request)

proxies = {
    'http':'118.24.219.151:16817'
}

# handler  build_opener  open
handler = urllib.request.ProxyHandler(proxies = proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)

# Get response information
content = response.read().decode('utf-8')

# preservation
with open('daili.html','w',encoding='utf-8')as fp:
    fp.write(content)

Agents can use: fast agents. You can use an agent pool instead of an agent

2. Analytical technology

2.1 xpath

xpath installation and loading

1. Install lxml Library

pip install lxml ‐i https://pypi.douban.com/simple

2. Import lxml etree

from lxml import etree

3.etree.parse() parses local files

html_tree = etree.parse('XX.html')

4.etree.HTML() server response file

html_tree = etree.HTML(response.read().decode('utf‐8')

5. Parse and obtain DOM elements

html_tree.xpath(xpath path)

According to the chrome plug-in of xpath, use ctrl + shift + x to open the plug-in

Basic xpath syntax

1. Path query

//: find all descendant nodes regardless of hierarchy. undefined /: find direct child nodes

2. Predicate query

//div[@id]undefined//div[@id="maincontent"]

3. Attribute query

//@class

4. Fuzzy query

//div[contains(@id, "he")]undefined//div[starts‐with(@id, "he")]

5. Content query

//div/h1/text()

6. Logical operation

//div[@id="head" and @class="s_down"]undefined//title | //price

Example:

from lxml import etree

# xpath parsing local files
tree = etree.parse('test.html')

# Find the li below
li_list = tree.xpath('//body/ul/li')

# Find the li tag of all attributes with id, and text() gets the content in the tag
li_list = tree.xpath('//ul/li[@id]/text()')

# Find the li tag with id l1. Pay attention to the quotation marks
li_list = tree.xpath('//ul/li[@id="l1"]/text()')

# Find the attribute value of the class of the li tag with id l1
li = tree.xpath('//ul/li[@id="l1"]/@class')

# The query id contains the li tag of l
li_list = tree.xpath('//ul/li[contains(@id,"l")]/text()')

# The value of the query id is the li tag starting with l
li_list = tree.xpath('//ul/li[starts-with(@id,"c")]/text()')

#Query with id l1 and class c1
li_list = tree.xpath('//ul/li[@id="l1" and @class="c1"]/text()')

li_list = tree.xpath('//ul/li[@id="l1"]/text() | //ul/li[@id="l2"]/text()')

2.2 JsonPath

JsonPath can only parse local files.

Installation and use of jsonpath

pip installation:

pip install jsonpath

Use of jsonpath:

obj = json.load(open('json file ',' r ', encoding =' utf ‐ 8 ')) undefined set = jsonpath Jsonpath (obj, 'jsonpath syntax')

Example:

{
  "store": {
    "book": [
      {
        "category": "Xiuzhen",
        "author": "Liudao",
        "title": "How do bad guys practice",
        "price": 8.95
      },
      {
        "category": "Xiuzhen",
        "author": "Silkworm potato",
        "title": "Break through the sky",
        "price": 12.99
      },
      {
        "category": "Xiuzhen",
        "author": "Tang family San Shao",
        "title": "Douluo continent",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      {
        "category": "Xiuzhen",
        "author": "Third uncle of Southern Sect",
        "title": "Star change",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "author": "Old horse",
      "color": "black",
      "price": 19.95
    }
  }
}

Parse the above json data. For specific syntax, refer to the following blog:

https://blog.csdn.net/luxideyao/article/details/77802389

import json
import jsonpath

obj = json.load(open('jsonpath.json','r',encoding='utf-8'))

# The author of all the books in the bookstore
author_list = jsonpath.jsonpath(obj,'$.store.book[*].author')

# All authors
author_list = jsonpath.jsonpath(obj,'$..author')

# All elements under store
tag_list = jsonpath.jsonpath(obj,'$.store.*')

# price of everything in the store
price_list = jsonpath.jsonpath(obj,'$.store..price')

# The third book
book = jsonpath.jsonpath(obj,'$..book[2]')

# The last book
book = jsonpath.jsonpath(obj,'$..book[(@.length-1)]')

#  The first two books
book_list = jsonpath.jsonpath(obj,'$..book[0,1]')
book_list = jsonpath.jsonpath(obj,'$..book[:2]')

# For conditional filtering, you need to add a filter in front of ()?
#   Filter out all books containing isbn.
book_list = jsonpath.jsonpath(obj,'$..book[?(@.isbn)]')

# Which book is more than 10 yuan
book_list = jsonpath.jsonpath(obj,'$..book[?(@.price>10)]')

2.3 BeautifulSoup

Basic introduction

  • Beautiful soup abbreviation: bs4
  • What is BeatifulSoup? Beautiful soup, like lxml, is an html parser. Its main function is to parse and extract data
  • Advantages and disadvantages
Disadvantages: no efficiency lxml High efficiency  
Advantages: the interface design is humanized and easy to use

Install and create

  • install
> pip install bs4 -i https://pypi.douban.com/simple
  • Import
> from bs4 import BeautifulSoup
  • create object
*   File generation object for server response
    
    > soup = BeautifulSoup(response.read().decode(), 'lxml')
    
*   Local file generation object
    
    > soup = BeautifulSoup(open('1.html'), 'lxml')
    

Note: the encoding format of the open file is gbk by default, so you need to specify the opening encoding format

Node location

1. Find nodes by tag name

soup.a # can only find the first auddefinedsoup a.nameundefinedsoup. a.attrs

2. Function

  • Find (returns an object)
> find('a'): Only the first one was found a label
> 
> find('a', title='name')
> 
> find('a', class\_='name')
  • find_ All (return a list)
> find\_all('a') : Find all a
> 
> find\_all(\['a', 'span'\]) Return all a and span
> 
> find\_all('a', limit=2) Just the first two a
  • Select (get the node object according to the selector) [☆☆]
*   `element`: p
*   `.class`: .firstname
*   `#id`: #firstname
*   `attribute selectors `:   
    `[attribute]`: li = soup.select('li\[class\]')  
    `[attribute=value]`: li = soup.select('li\[class="hengheng1"\]')
*   `Level selector`:  
    div p Descendant Selectors   
    div>p Descendant selector: the first child label of a label  
    div,p div or p Label all objects

Node information

  • Get node content: applicable to the structure of nested labels in labels
> obj.string
> 
> obj.get\_text()[[recommended]
  • Properties of nodes
> tag.name: Get tag name
> 
> tag.attrs: Returns the property value as a dictionary
  • Get node properties
> obj.attrs.get('title')[[common]
> 
> obj.get('title')
> 
> obj\['title'\]

Example:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div>
        <ul>
            <li id="l1">Zhang San</li>
            <li id="l2">Li Si</li>
            <li>Wang Wu</li>
            <a href="" id="" class="a1">Shang Silicon Valley</a>
            <span>Hey, hey, hey</span>
        </ul>
    </div>
    <a href="" title="a2">Baidu</a>
    <div id="d1">
        <span>
            Ha ha ha
        </span>
    </div>
    <p id="p1" class="p1">Interesting</p>
</body>
</html>

Parse the above html using BeautifulSoup

from bs4 import BeautifulSoup

# By default, the encoding format of the open file is gbk, so you need to specify the encoding when opening the file
soup = BeautifulSoup(open('bs4 Basic use of.html',encoding='utf-8'),'lxml')

# Find the node according to the tag name, and the first qualified data is found
print(soup.a)
# Gets the properties and property values of the tag
print(soup.a.attrs)

# Some functions of bs4
# (1) find: returns the first qualified data
print(soup.find('a'))

# Find the corresponding label object according to the value of title
print(soup.find('a',title="a2"))

# Find the corresponding label object according to the value of class. Note that class needs to be underlined
print(soup.find('a',class_="a1"))

# (2)find_all returns a list and all a tags
print(soup.find_all('a'))

# If you want to get the data of multiple tags, you need to find_ The data of the list is added to the parameter of all
print(soup.find_all(['a','span']))

# The function of limit is to find the first few data
print(soup.find_all('li',limit=2))

# (3) select (recommended)
# The select method returns a list and multiple data
print(soup.select('a'))

# Yes On behalf of class, we call this operation class selector
print(soup.select('.a1'))
print(soup.select('#l1'))

# Attribute selector: find the corresponding label through the attribute
# Find the tag with id in the li tag
print(soup.select('li[id]'))

# Find the label with id l2 in the li label
print(soup.select('li[id="l2"]'))

# Level selector
#  Descendant selector: find li under div
print(soup.select('div li'))

# Descendant selector: the first child label of a label
print(soup.select('div > ul > li'))

# Find all the objects of a tag and li tag
print(soup.select('a,li'))

# Get node content
obj = soup.select('#d1')[0]

# If there is only content in the tag object, then string and get_text() can be used
# If there are tags in the tag object in addition to the content, the string will not get the data, but get_text() is the data that can be obtained
# Get is recommended_ text()
print(obj.string)
print(obj.get_text())

# Properties of nodes
obj = soup.select('#p1')[0]
# Name is the name of the tag
print(obj.name)
# Returns a dictionary around the attribute value
print(obj.attrs)

# Gets the properties of the node
print(obj.attrs.get('class'))
print(obj.get('class'))
print(obj['class'])