Get object (tag) name, property, content, comment

Posted by etoast on Wed, 19 Feb 2020 17:41:47 +0100

How to get object (tag) name, attribute, content, comment and other operations by using the Python crawler beautifulsop
1, Tag object

1. The tag object is the same as the tag in the XML or HTML native document.

from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.b
type(tag)
bs4.element.Tag

2. Name property of tag

Each tag has its own name, which is obtained by. Name

tag.name
'b'
tag.name = "blockquote" # Modify the original document
tag
<blockquote class="boldest">Extremely bold</blockquote>

3. Attributes attribute of tag

Get single attribute

tag['class']
['boldest']

Get all properties as a dictionary

tag.attrs
{'class': ['boldest']}

Add attribute

tag['class'] = 'verybold'
tag['id'] = 1
print(tag)
<blockquote class="verybold" id="1">Extremely bold</blockquote>

Delete attribute

del tag['class']
del tag['id']
tag
<blockquote>Extremely bold</blockquote>

4. Multi value attribute of tag

Multi valued properties return a list

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
print(css_soup.p['class'])
['body', 'strikeout']
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','lxml')
print(rel_soup.a['rel'])
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
['index']
<p>Back to the <a rel="index contents">homepage</a></p>

If the converted document is in XML format, the tag does not contain multi value attributes

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

```bash

'body strikeout'

2, Navigable string

1. Strings are often included in tags, and NavigableString class is used to wrap strings in tags

```bash
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.b
print(tag.string)
print(type(tag.string))
Extremely bold
<class 'bs4.element.NavigableString'>

2. A NavigableString string is the same as the str string in Python. The NavigableString object can be directly converted to STR string through str() method

unicode_string = str(tag.string)
print(unicode_string)
print(type(unicode_string))
Extremely bold
<class 'str'>

3. The strings contained in the tag cannot be edited, but can be replaced with other strings. Use the replace with() method

tag.string.replace_with("No longer bold")
tag
<b class="boldest">No longer bold</b>

3, Beautifulsop object beautifulsop object represents the whole content of a document.

Most of the time, you can think of it as a Tag object, which supports traversing the document tree and searching most of the methods described in the document tree.

Four. Comment and special string object

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'lxml')
comment = soup.b.string
type(comment)
bs4.element.Comment

The Comment object is a special type of NavigableString object

comment
'Hey, buddy. Want to buy a used parser?'

Recommend our Python learning base to see how the seniors learn! From basic Python script, crawler, django, data mining and other programming technologies, as well as sorting out zero basic data to project actual combat data, to every little partner who loves to learn Python! Every day, seniors regularly explain Python technology, share some learning methods and small details that need attention, and click to join our python learners' gathering place

33 original articles published, 25 praised, 30000 visitors+
Private letter follow

Topics: Python Attribute xml Django