output
Format the output, format the beautiful soup document tree with the prettify() method, and then output it in Unicode encoding. Each XML/HTML tag occupies a separate line
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup,"html.parser") print(soup.prettify()) """ <a href="http://example.com/"> I linked to <i> example.com </i> </a> """
The above is to call prettify using the soup object. You can also use the tag node, for example:
a_tag = soup.a print(a_tag.prettify()) """ <a href="http://example.com/"> I linked to <i> example.com </i> </a> """
compression out
If you only want to output the result as a string and don't care about the format, you can use the unicode() or str() method of the beautifulsup object or tag object. Note that unicode is python2 Methods in X, the difference between str() and unicode methods has been eliminated in Python 3
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup,"html.parser") print(str(soup)) # <a href="http://example.com/">I linked to <i>example.com</i></a>
Output format
Beautiful Soup output converts special characters in HTML into Unicode, such as "& lquote;":
soup = BeautifulSoup("“Dammit!” he said.") print(unicode(soup)) # <html><head></head><body>\u201cDammit!\u201d he said.</body></html>
get_text() method
Get the text content contained in the tag. You can call get_text() method, which gets all the contents contained in the tag, including the child tag nodes
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' soup = BeautifulSoup(markup,"html.parser") print(soup.get_text()) # I linked to example.com
Specify document parser
If you do not specify a document parser, beautiful soup will automatically find the most appropriate parser, but we can't guarantee that the same code will run in different systems, so it's best to specify a parser; The first parameter of BeautifulSoup() is the document string or document handle to be parsed, and the second parameter is used to identify how to parse the document
-
Types that support parsing: html, xml, html5
-
Specify which parser: lxml, html5lib, HTML parser
For the difference between parsers, you need to install the lxml library first
Parse to html structure
soup = BeautifulSoup('<a><b /></a>') print(soup) # <html><body><a><b></b></a></body></html>
The same document is parsed into an xml structure
soup = BeautifulSoup('<a><b /></a>',"xml") print(soup) # <?xml version="1.0" encoding="utf-8"?> # <a><b/></a>
There are also differences between HTML parsers. If the parsed HTML document is in standard format, there is no difference between parsers, but the parsing speed is different, and the result will return the correct document tree; However, if the parsed document is not in standard format, different parsers may return different results In the following example, the malformed document is parsed using lxml, and the result is
The label was directly ignoredsoup = BeautifulSoup("<a></p>", "lxml") print(soup) # <html><body><a></a></body></html>
Parsing the same document using the html5lib library yields different results
soup = BeautifulSoup("<a></p>","html5lib") print(soup) # <html><head></head><body><a><p></p></a></body></html>)
Parsing the same document results using python's built-in Library
soup = BeautifulSoup("<a></p>","html.parser") print(soup) # <a></a>
code
After any html or xml document is parsed by beautiful soup, it will be converted into Unicode encoding format through the of beautiful soup object original_ The encoding property records the result of automatic recognition of encoding. We can set from when creating the beautifulsup object_ Encoding parameter to specify the encoding format
Output coding
When outputting a document through Beautiful Soup, the output code is UTF-8 regardless of the encoding method of the input document
markup = b''' <html> <head> <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> </head> <body> <p>Sacr\xe9 bleu!</p> </body> </html> ''' soup = BeautifulSoup(markup,"html5lib") print(soup.prettify()) """ <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-type"/> </head> <body> <p> Sacré bleu! </p> </body> </html> """
If you don't want to encode the output with UTF-8, you can pass the encoding method into the prettify() method
Are the comparison objects the same
When two NavigableString or Tag objects have the same HTML or XML structure, the Beautiful Soup judges that the two objects are the same
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>" soup = BeautifulSoup(markup, 'html.parser') first_b, second_b = soup.find_all('b') print(first_b == second_b) # True
If you want to strictly judge whether two objects point to exactly one object, you can use is
print(first_b is second_b) # False
Parse some documents
If you just want to find