Beautiful Soup is simple to parse HTML. API is very human. It supports CSS selector, HTML parser in Python standard library, and XML parser in lxml.
Compared with regularity, it is simpler to use.
Examples:
First you have to import the bs4 Library
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # Format the content of the output soup object print(soup.prettify())
Operation results
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
Four categories of objects
BeautifulSoup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into four types:
- Tag
- NavigableString
- BeautifulSoup
- Comment
1.Tag
Tag is popularly referred to as a tag in HTML, such as:
<head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
The above HTML tags, such as title head a p, and so on, plus the content included in them is Tag, so try to use Beautiful Soup to get Tags:
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # # Print title tags print(soup.title) # Print the head tag print(soup.head) # Print a label print(soup.a) # Print p label print(soup.p) # Types of printing soup.p print(type(soup.p))
Operation results
<title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <class 'bs4.element.Tag'>
We can easily retrieve these tag contents by using soup with tag names. The type of these objects is bs4.element.Tag. Note, however, that it looks for the first qualified label in all content. If you need to query all tags, we will introduce them later.
For Tag, it has two important attributes, name and attrs.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # The soup object is special, and its name is [document] print(soup.name) # For other internal tags, the output value is the name of the tag itself. print(soup.head.name) # Print all attributes of the p tag, the type of which is a dictionary print(soup.p.attrs) # Print the class attribute of the p tag print(soup.p['class']) # You can also use get method to get attributes and pass in the name of attributes, which is equivalent to the above method. print(soup.p.get('class')) print(soup.p) # modify attribute soup.p['class'] = "newClass" print(soup.p) # Delete attributes del soup.p['class'] print(soup.p)
Operation results
[document] head {'class': ['title'], 'name': 'dromouse'} ['title'] ['title'] <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> <p name="dromouse"><b>The Dormouse's story</b></p>
2.NavigableString
Now that we've got the content of the label, the question arises. What do we do if we want to get the text inside the label? It's easy to use. string, for example:
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # Print the content of the p tag print(soup.p.string) # Types of printing soup.p.string print(type(soup.p.string))
Operation results
The Dormouse's story <class 'bs4.element.NavigableString'>
3.BeautifulSoup
BeautifulSoup objects represent the content of a document. Most of the time, it can be treated as a Tag object, which is a special Tag. We can get its type, name, and attributes separately.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # type print(type(soup.name)) # Name print(soup.name) # attribute print(soup.attrs)
Operation results
<class 'str'> [document] {}
4.Comment
Comment object is a special type of Navigable String object whose output does not include annotation symbols.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
Operation results
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie <class 'bs4.element.Comment'>
The content in a tag is actually annotations, but if we use. string to output its content, the annotation symbol has been removed.
Traversing Document Tree
1. Direct child node:. contents. children attribute
.content
Tag's. content attribute can output Tag's child nodes in a list manner
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # The output is a list print(soup.head.contents) print(soup.head.contents[0])
Operation results
[<title>The Dormouse's story</title>] <title>The Dormouse's story</title>
.children
It does not return a list, but we can get all the child nodes by traversing.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # Output mode is list generator object print(soup.head.children) # Obtain all child nodes by traversal for child in soup.head.children: print(child)
Operation results
<list_iterator object at 0x008FF950> <title>The Dormouse's story</title>
2. All descendants:.Descendants attribute
The. contents and. children attributes mentioned above only contain the direct child nodes of Tag. The. descendants attributes can recursively circle all the descendants of Tag. Similar to children, we also need to retrieve the contents through traversal.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # Output mode is list generator object print(soup.head.descendants) # Obtain all descendant nodes by traversal for child in soup.head.descendants: print(child)
Operation results
<generator object descendants at 0x00519AB0> <title>The Dormouse's story</title> The Dormouse's story
3. Node content:.string attribute
If a Tag has only one Navigable String type child node, then the Tag can use. string to get the child node. If a Tag has only one child node, then the Tab can also use the. string method, and the output result is the same as the. string result of the current unique child node.
Popularly speaking, if there is no label in a label, then. string will return the contents of the label. If there is only one tag in the tag, then. string will return the contents. For example:
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.head.string) print(soup.head.title.string)
Operation results
The Dormouse's story The Dormouse's story
Search Document Tree
1.find_all(name, attrs, recursive, text, **kwargs)
1)name parameter
The name parameter finds all Tag s with names, and string objects are automatically ignored.
a. Passing strings
The simplest filter is a string. When a string parameter is passed into the search method, Beautiful Soup finds everything that matches the string and returns a list.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.find_all("b")) print(soup.find_all("a"))
Operation results
[<b>The Dormouse's story</b>] [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
B. Passing Regular Expressions
If a regular expression is passed in as a parameter, Beautiful Soup matches the content through the regular expression match().
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup import re html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") for tag in soup.find_all(re.compile("^b")): print(tag.name)
Operation results
body b
C. Spread List
If a list parameter is passed in, Beautiful Soup returns a list of contents that match any element in the list.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.find_all(['a', 'b']))
2)keyword parameter
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.find_all(id="link1"))
Operation results
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
3)text parameters
The text parameter can search the string content in the document. Like the optional value of the name parameter, the text parameter accepts strings, regular expressions, lists.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup import re html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") # Character string print(soup.find_all(text = " Elsie ")) # list print(soup.find_all(text = ["Tillie", " Elsie ", "Lacie"])) # regular expression print(soup.find_all(text = re.compile("Dormouse")))
Operation results
[' Elsie '] [' Elsie ', 'Lacie', 'Tillie'] ["The Dormouse's story", "The Dormouse's story"]
CSS Selector
This is another search method that is similar to the find_all() method.
- When writing CSS, the label name is not modified, the class name is prefixed with., and the id name is prefixed with..#
- Here we can also use a similar method to filter elements, using soup.select(), which returns a list of types.
(1) Search by tag name
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("title")) print(soup.select("b")) print(soup.select("a"))
Operation results
[<title>The Dormouse's story</title>] [<b>The Dormouse's story</b>] [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(2) Searching by Class Name
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select(".title"))
Operation results
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
(3) Finding by id name
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("#link1"))
Operation results
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
(4) Combination Search
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("p #link1"))
Operation results
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(5) Attribute lookup
Attribute elements can also be added when searching. Attributes need to be enclosed in brackets. Attributes and labels belong to the same node, so no spaces can be added in the middle, otherwise they will not match.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("a[class='sister']"))
Operation results
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Similarly, attributes can still be combined with the above lookup methods, not separated by spaces at the same node, and no spaces at the same node.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("p a[class='sister']"))
Operation results
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(6) Access to content
The results returned by the above select() method are in the form of lists, which can be output in a traversal form, and then get_text() method is used to obtain its content.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Create Beautiful Soup object, specify lxml parser soup = BeautifulSoup(html, "lxml") print(soup.select("p a[class='sister']")) for item in soup.select("p a[class='sister']"): print(item.get_text())
Operation results
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] Lacie Tillie
Note: <! - Elsie - > is comment content, not output