How to get object (tag) name, attribute, content, comment and other operations by using the Python crawler beautifulsop
1, Tag object
1. The tag object is the same as the tag in the XML or HTML native document.
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml') tag = soup.b type(tag)
bs4.element.Tag
2. Name property of tag
Each tag has its own name, which is obtained by. Name
tag.name
'b'
tag.name = "blockquote" # Modify the original document tag
<blockquote class="boldest">Extremely bold</blockquote>
3. Attributes attribute of tag
Get single attribute
tag['class']
['boldest']
Get all properties as a dictionary
tag.attrs
{'class': ['boldest']}
Add attribute
tag['class'] = 'verybold' tag['id'] = 1 print(tag)
<blockquote class="verybold" id="1">Extremely bold</blockquote>
Delete attribute
del tag['class'] del tag['id'] tag
<blockquote>Extremely bold</blockquote>
4. Multi value attribute of tag
Multi valued properties return a list
css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml') print(css_soup.p['class'])
['body', 'strikeout']
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','lxml') print(rel_soup.a['rel']) rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p)
['index'] <p>Back to the <a rel="index contents">homepage</a></p>
If the converted document is in XML format, the tag does not contain multi value attributes
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') xml_soup.p['class'] ```bash
'body strikeout'
2, Navigable string 1. Strings are often included in tags, and NavigableString class is used to wrap strings in tags ```bash from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml') tag = soup.b print(tag.string) print(type(tag.string))
Extremely bold <class 'bs4.element.NavigableString'>
2. A NavigableString string is the same as the str string in Python. The NavigableString object can be directly converted to STR string through str() method
unicode_string = str(tag.string) print(unicode_string) print(type(unicode_string))
Extremely bold <class 'str'>
3. The strings contained in the tag cannot be edited, but can be replaced with other strings. Use the replace with() method
tag.string.replace_with("No longer bold") tag
<b class="boldest">No longer bold</b>
3, Beautifulsop object beautifulsop object represents the whole content of a document.
Most of the time, you can think of it as a Tag object, which supports traversing the document tree and searching most of the methods described in the document tree.
Four. Comment and special string object
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup,'lxml') comment = soup.b.string type(comment)
bs4.element.Comment
The Comment object is a special type of NavigableString object
comment
'Hey, buddy. Want to buy a used parser?'
Recommend our Python learning base to see how the seniors learn! From basic Python script, crawler, django, data mining and other programming technologies, as well as sorting out zero basic data to project actual combat data, to every little partner who loves to learn Python! Every day, seniors regularly explain Python technology, share some learning methods and small details that need attention, and click to join our python learners' gathering place