Crawler - Beautiful Soup4 parser

Posted by rbama on Tue, 11 Jun 2019 23:57:40 +0200

Beautiful Soup is simple to parse HTML. API is very human. It supports CSS selector, HTML parser in Python standard library, and XML parser in lxml.

Compared with regularity, it is simpler to use.

Examples:

First you have to import the bs4 Library

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'


from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# Format the content of the output soup object
print(soup.prettify())

Operation results

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Four categories of objects

BeautifulSoup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into four types:

Tag
NavigableString
BeautifulSoup
Comment

1.Tag

Tag is popularly referred to as a tag in HTML, such as:

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

The above HTML tags, such as title head a p, and so on, plus the content included in them is Tag, so try to use Beautiful Soup to get Tags:

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# # Print title tags
print(soup.title)

# Print the head tag
print(soup.head)

# Print a label
print(soup.a)

# Print p label
print(soup.p)

# Types of printing soup.p
print(type(soup.p))

Operation results

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>

We can easily retrieve these tag contents by using soup with tag names. The type of these objects is bs4.element.Tag. Note, however, that it looks for the first qualified label in all content. If you need to query all tags, we will introduce them later.

For Tag, it has two important attributes, name and attrs.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# The soup object is special, and its name is [document]
print(soup.name)

# For other internal tags, the output value is the name of the tag itself.
print(soup.head.name)

# Print all attributes of the p tag, the type of which is a dictionary
print(soup.p.attrs)

# Print the class attribute of the p tag
print(soup.p['class'])
# You can also use get method to get attributes and pass in the name of attributes, which is equivalent to the above method.
print(soup.p.get('class'))

print(soup.p)

# modify attribute
soup.p['class'] = "newClass"
print(soup.p)

# Delete attributes
del soup.p['class']
print(soup.p)

Operation results

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

Now that we've got the content of the label, the question arises. What do we do if we want to get the text inside the label? It's easy to use. string, for example:

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# Print the content of the p tag
print(soup.p.string)

# Types of printing soup.p.string
print(type(soup.p.string))

Operation results

The Dormouse's story
<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup objects represent the content of a document. Most of the time, it can be treated as a Tag object, which is a special Tag. We can get its type, name, and attributes separately.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# type
print(type(soup.name))

# Name
print(soup.name)

# attribute
print(soup.attrs)

Operation results

<class 'str'>
[document]
{}

4.Comment

Comment object is a special type of Navigable String object whose output does not include annotation symbols.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.a)

print(soup.a.string)

print(type(soup.a.string))

Operation results

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>

The content in a tag is actually annotations, but if we use. string to output its content, the annotation symbol has been removed.

Traversing Document Tree

1. Direct child node:. contents. children attribute

.content

Tag's. content attribute can output Tag's child nodes in a list manner

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# The output is a list
print(soup.head.contents)

print(soup.head.contents[0])

Operation results

[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>

.children

It does not return a list, but we can get all the child nodes by traversing.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# Output mode is list generator object
print(soup.head.children)

# Obtain all child nodes by traversal
for child in soup.head.children:
    print(child)

Operation results

<list_iterator object at 0x008FF950>
<title>The Dormouse's story</title>

2. All descendants:.Descendants attribute

The. contents and. children attributes mentioned above only contain the direct child nodes of Tag. The. descendants attributes can recursively circle all the descendants of Tag. Similar to children, we also need to retrieve the contents through traversal.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# Output mode is list generator object
print(soup.head.descendants)

# Obtain all descendant nodes by traversal
for child in soup.head.descendants:
    print(child)

Operation results

<generator object descendants at 0x00519AB0>
<title>The Dormouse's story</title>
The Dormouse's story

3. Node content:.string attribute

If a Tag has only one Navigable String type child node, then the Tag can use. string to get the child node. If a Tag has only one child node, then the Tab can also use the. string method, and the output result is the same as the. string result of the current unique child node.

Popularly speaking, if there is no label in a label, then. string will return the contents of the label. If there is only one tag in the tag, then. string will return the contents. For example:

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.head.string)

print(soup.head.title.string)

Operation results

The Dormouse's story
The Dormouse's story

Search Document Tree

1.find_all(name, attrs, recursive, text, **kwargs)

1)name parameter

The name parameter finds all Tag s with names, and string objects are automatically ignored.

a. Passing strings

The simplest filter is a string. When a string parameter is passed into the search method, Beautiful Soup finds everything that matches the string and returns a list.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.find_all("b"))

print(soup.find_all("a"))

Operation results

[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B. Passing Regular Expressions

If a regular expression is passed in as a parameter, Beautiful Soup matches the content through the regular expression match().

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

Operation results

body
b

C. Spread List

If a list parameter is passed in, Beautiful Soup returns a list of contents that match any element in the list.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.find_all(['a', 'b']))

2)keyword parameter

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.find_all(id="link1"))

Operation results

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

3)text parameters

The text parameter can search the string content in the document. Like the optional value of the name parameter, the text parameter accepts strings, regular expressions, lists.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

# Character string
print(soup.find_all(text = " Elsie "))

# list
print(soup.find_all(text = ["Tillie", " Elsie ", "Lacie"]))

# regular expression
print(soup.find_all(text = re.compile("Dormouse")))

Operation results

[' Elsie ']
[' Elsie ', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]

CSS Selector

This is another search method that is similar to the find_all() method.

When writing CSS, the label name is not modified, the class name is prefixed with., and the id name is prefixed with..#
Here we can also use a similar method to filter elements, using soup.select(), which returns a list of types.

(1) Search by tag name

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("title"))

print(soup.select("b"))

print(soup.select("a"))

Operation results

[<title>The Dormouse's story</title>]
[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(2) Searching by Class Name

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select(".title"))

Operation results

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

(3) Finding by id name

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("#link1"))

Operation results

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

(4) Combination Search

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("p #link1"))

Operation results

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(5) Attribute lookup

Attribute elements can also be added when searching. Attributes need to be enclosed in brackets. Attributes and labels belong to the same node, so no spaces can be added in the middle, otherwise they will not match.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("a[class='sister']"))

Operation results

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Similarly, attributes can still be combined with the above lookup methods, not separated by spaces at the same node, and no spaces at the same node.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("p a[class='sister']"))

Operation results

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(6) Access to content

The results returned by the above select() method are in the form of lists, which can be output in a traversal form, and then get_text() method is used to obtain its content.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# Create Beautiful Soup object, specify lxml parser
soup = BeautifulSoup(html, "lxml")

print(soup.select("p a[class='sister']"))

for item in soup.select("p a[class='sister']"):
    print(item.get_text())

Operation results

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Lacie
Tillie

Note: <! - Elsie - > is comment content, not output

Topics: Python Attribute xml

Programmer Think

Crawler - Beautiful Soup4 parser

Examples:

Four categories of objects

1.Tag

2.NavigableString

3.BeautifulSoup

4.Comment

Traversing Document Tree

1. Direct child node:. contents. children attribute

2. All descendants:.Descendants attribute

3. Node content:.string attribute

Search Document Tree

1.find_all(name, attrs, recursive, text, **kwargs)

2)keyword parameter

3)text parameters

CSS Selector

(1) Search by tag name

(2) Searching by Class Name

(3) Finding by id name

(4) Combination Search

(5) Attribute lookup

(6) Access to content

Hot Topics