Gently learn the reptile - sweep framework and skillfully use 5 - Monkey stealing peach

Posted by Braimaster on Sat, 29 Jan 2022 14:38:55 +0100

Gently learn from reptile - skilful use of scratch frame 5 - Monkey stealing peach (1)

Last class talked about the start-up process of the crawler. I believe you have some understanding of the framework. Today we will talk about the crawler branch and analyze the page.

We compare a peach tree to the data we catch, but only the peaches in the book make us need. We don't want other data. How can we take these peaches?

This requires us to analyze the artifact - beautiful soup.

Beautiful Soup

Beautiful Soup is a Python library that can extract data from HTML or XML files.

install

At present, Beautiful Soup has been updated to the fourth version. Install with the following command

pip install bs4

After installation, we can use it.

We have an html file.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

This document is not a standard. Let's standardize the document first. For standardization, we need to parse html files. We have two commonly used parsing libraries.

Parsing library

Parserusage methodadvantageinferiority
Python standard libraryBeautifulSoup(markup, "html.parser")Python's built-in standard library has moderate execution speed and strong document fault tolerancePython versions before 2.7.3 or 3.2.2) have poor fault tolerance
lxml HTML parserBeautifulSoup(markup, "lxml")High speed and strong fault tolerance of documentsC language library needs to be installed

The first one comes with the system, and the second one is a third-party library, which we need to install. Use the following command to install.

pip install lxml

Let's use the standard library for parsing.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.prettify()
print(soup)


#Get the following structured html
"""<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>"""

Beautiful Soup transforms complex HTML documents into a complex tree structure, and each node is a Python object.

Tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.prettify()
tag = soup.b
print(tag)
print(type(tag))


# <b>The Dormouse's story</b>
# <class 'bs4.element.Tag'>

Name

Each tag has its own name, through Name to get:

print(tag.name)

# b

If you change the name of the tag, it will affect all HTML documents generated through the current Beautiful Soup object:

tag.name = "blockquote"
print(tag)
# <blockquote>The Dormouse's story</blockquote>

Due to bs4 too much content, only a part of the knowledge will be discussed here. Welcome to collect and prevent losing.

The code word is not easy. Please leave a message in the comment area for collection. Or join Group chat Make progress and study together.

Topics: Python crawler tag