Gently learn from reptile - skilful use of scratch frame 5 - Monkey stealing peach (1)
Last class talked about the start-up process of the crawler. I believe you have some understanding of the framework. Today we will talk about the crawler branch and analyze the page.
We compare a peach tree to the data we catch, but only the peaches in the book make us need. We don't want other data. How can we take these peaches?
This requires us to analyze the artifact - beautiful soup.
Beautiful Soup
Beautiful Soup is a Python library that can extract data from HTML or XML files.
install
At present, Beautiful Soup has been updated to the fourth version. Install with the following command
pip install bs4
After installation, we can use it.
We have an html file.
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
This document is not a standard. Let's standardize the document first. For standardization, we need to parse html files. We have two commonly used parsing libraries.
Parsing library
Parser | usage method | advantage | inferiority |
---|---|---|---|
Python standard library | BeautifulSoup(markup, "html.parser") | Python's built-in standard library has moderate execution speed and strong document fault tolerance | Python versions before 2.7.3 or 3.2.2) have poor fault tolerance |
lxml HTML parser | BeautifulSoup(markup, "lxml") | High speed and strong fault tolerance of documents | C language library needs to be installed |
The first one comes with the system, and the second one is a third-party library, which we need to install. Use the following command to install.
pip install lxml
Let's use the standard library for parsing.
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') soup.prettify() print(soup) #Get the following structured html """<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>"""
Beautiful Soup transforms complex HTML documents into a complex tree structure, and each node is a Python object.
Tag
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') soup.prettify() tag = soup.b print(tag) print(type(tag)) # <b>The Dormouse's story</b> # <class 'bs4.element.Tag'>
Name
Each tag has its own name, through Name to get:
print(tag.name) # b
If you change the name of the tag, it will affect all HTML documents generated through the current Beautiful Soup object:
tag.name = "blockquote" print(tag) # <blockquote>The Dormouse's story</blockquote>
Due to bs4 too much content, only a part of the knowledge will be discussed here. Welcome to collect and prevent losing.
The code word is not easy. Please leave a message in the comment area for collection. Or join Group chat Make progress and study together.