python third-party library beautiful soup4 document learning

Posted by yorktown on Sun, 16 Jan 2022 21:12:28 +0100

output

Format the output, format the beautiful soup document tree with the prettify() method, and then output it in Unicode encoding. Each XML/HTML tag occupies a separate line

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'

soup = BeautifulSoup(markup,"html.parser")

print(soup.prettify())

"""
<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>
"""

The above is to call prettify using the soup object. You can also use the tag node, for example:

a_tag = soup.a

print(a_tag.prettify())

"""
<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>
"""

compression out

If you only want to output the result as a string and don't care about the format, you can use the unicode() or str() method of the beautifulsup object or tag object. Note that unicode is python2 Methods in X, the difference between str() and unicode methods has been eliminated in Python 3

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'

soup = BeautifulSoup(markup,"html.parser")

print(str(soup))

# <a href="http://example.com/">I linked to <i>example.com</i></a>

Output format

Beautiful Soup output converts special characters in HTML into Unicode, such as "& lquote;":

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")

print(unicode(soup))
# <html><head></head><body>\u201cDammit!\u201d he said.</body></html>

get_text() method

Get the text content contained in the tag. You can call get_text() method, which gets all the contents contained in the tag, including the child tag nodes

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'

soup = BeautifulSoup(markup,"html.parser")

print(soup.get_text())

# I linked to example.com

Specify document parser

If you do not specify a document parser, beautiful soup will automatically find the most appropriate parser, but we can't guarantee that the same code will run in different systems, so it's best to specify a parser; The first parameter of BeautifulSoup() is the document string or document handle to be parsed, and the second parameter is used to identify how to parse the document

  • Types that support parsing: html, xml, html5

  • Specify which parser: lxml, html5lib, HTML parser

For the difference between parsers, you need to install the lxml library first

Parse to html structure

soup = BeautifulSoup('<a><b /></a>')

print(soup)

# <html><body><a><b></b></a></body></html>

The same document is parsed into an xml structure

soup = BeautifulSoup('<a><b /></a>',"xml")

print(soup)

# <?xml version="1.0" encoding="utf-8"?>

# <a><b/></a>

There are also differences between HTML parsers. If the parsed HTML document is in standard format, there is no difference between parsers, but the parsing speed is different, and the result will return the correct document tree; However, if the parsed document is not in standard format, different parsers may return different results In the following example, the malformed document is parsed using lxml, and the result is

The label was directly ignored

soup = BeautifulSoup("<a></p>", "lxml")

print(soup)

# <html><body><a></a></body></html>

Parsing the same document using the html5lib library yields different results

soup = BeautifulSoup("<a></p>","html5lib")

print(soup)

# <html><head></head><body><a><p></p></a></body></html>)

Parsing the same document results using python's built-in Library

soup = BeautifulSoup("<a></p>","html.parser")

print(soup)

# <a></a>

code

After any html or xml document is parsed by beautiful soup, it will be converted into Unicode encoding format through the of beautiful soup object original_ The encoding property records the result of automatic recognition of encoding. We can set from when creating the beautifulsup object_ Encoding parameter to specify the encoding format

Output coding

When outputting a document through Beautiful Soup, the output code is UTF-8 regardless of the encoding method of the input document

markup = b'''

<html>

 <head>

    <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />

 </head>

 <body>

    <p>Sacr\xe9 bleu!</p>

 </body>

</html>

'''
soup = BeautifulSoup(markup,"html5lib")

print(soup.prettify())

"""
<html>

 <head>

  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>

 </head>

 <body>

  <p>

  Sacré bleu!

 </p>

 </body>

</html>
"""

If you don't want to encode the output with UTF-8, you can pass the encoding method into the prettify() method

Are the comparison objects the same

When two NavigableString or Tag objects have the same HTML or XML structure, the Beautiful Soup judges that the two objects are the same

markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"

soup = BeautifulSoup(markup, 'html.parser')

first_b, second_b = soup.find_all('b')

print(first_b == second_b)

# True

If you want to strictly judge whether two objects point to exactly one object, you can use is

print(first_b is second_b)

# False

Parse some documents

If you just want to find

bs4.4.0 document guidance

Topics: beautifulsoup