Parsel - Crawler page parsing tool
[statement]: This article is not original, but the source address of the article was not found
parsel is a product of sweep and a built-in selector of sweep. It includes re, css and xpath selectors. It can extract and delete data from HTML and XML. Using this library can make the code concise.
1 installation
pip install parsel or easy_install parsel
2 usage
2.1 create a Selector with HTML or XML to be parsed
from parsel import Selector text = """ <html> <body> <h1>Hello, Parsel!</h1> <div> <ul> <li class="item-0"> <a href="link1.html">first item</a> </li> <li class="item-1"> <a href="link2.html">second item</a> </li> <li class="item-inactive"> <a href="link3.html">third item</a> </li> <li class="item-1"> <a href="link4.html">fourth item</a> </li> <li class="item-0"> <a href="link5.html">fifth item</a> </li> </ul> </div> </body> </html> """ selector = Selector(text=text) # python2 text must be a unicode string.
2.2 then use CSS or Xpath expressions to extract the required information
selector.css('h1::text') # [<Selector xpath='descendant-or-self::h1/text()' data='Hello, Parsel!'>] selector.css('h1::text').get() # 'Hello, Parsel!' selector.xpath('//h1/text()').getall() # ['Hello, Parsel!'] # Associative regularity selector.xpath('//h1/text()').re_first(r'(H\w+)lo') # Hel selector.xpath('//h1/text()').re(r'H(\w+)lo, Par(\w+el)') # ['el', 'sel'] # re EXSLT extension supporting xpath selector.xpath(r'//li[re:test(@class, "item-\d$")]//@href').getall() # selector.xpath('//li/a/@href').getall() # ['link1.html', 'link2.html', 'link4.html', 'link5.html']
3 CSS syntax
3.1 Universal selector
Syntax:*
Description: * will match all elements of the document.
Example: div / * will match all child node elements within the < div > node.
from parsel import Selector text = """ <div class="table"> <plate /> <plate /> </div> """ selector = Selector(text=text) selector.css('div/*').getall() # ['<div class="table">\n <plate></plate>\n <plate></plate>\n</div>']
3.2 Type selector
Syntax: elementname
Example: plate matches all < plate > elements.
from parsel import Selector text = """ <div class="table"> <plate /> <bento /> <plate /> </div> """ selector = Selector(text=text) selector.css('plate').getall() # ['<plate></plate>', '<plate></plate>']
3.3 Class selector
Syntax: classname
Example: Small matches any element in the class attribute that contains the "small" class.
from parsel import Selector text = """ <div class="table"> <apple /> <apple class="small" /> <plate> <apple class="small" /> </plate> <plate /> </div> """ selector = Selector(text=text) selector.css('.small').getall() # ['<apple class="small"></apple>', '<apple class="small"></apple>']
3.4 ID selector
Syntax: #idname
Example: #fancy matches the element with id "fancy".
from parsel import Selector text = """ <div class="table"> <plate id="fancy" /> <plate /> <bento /> </div> """ selector = Selector(text=text) selector.css('#fancy').getall() # ['<plate id="fancy"></plate>']
3.5 Attribute selector
Syntax: [attr] [attr=value] [attr~=value] [attr|=value] [attr^=value] [attr$=value] [attr*=value]
Example | effect |
---|---|
p[attr] | Select all p elements with attr attribute. |
p[attr=value] | Select all p elements with attr attribute "value". |
p[attr~=value] | Select all p elements whose attr attribute contains the word "value". |
p[attr^=value] | Select each p element whose attr attribute value starts with "value". |
p[attr$=value] | Select each p element whose attr attribute value ends with "value". |
p[attr*=value] | Select each p element whose attr attribute contains a "value" substring. |
from parsel import Selector text = """ <div class="table"> <apple class="small" /> <bento for="Hayato"> <pickle /> </bento> <apple for="Ryota" /> <plate for="Minato"> <orange /> </plate> <pickle class="small" /> </div> """ selector = Selector(text=text) selector.css('[for$="o"]').getall() # ['<bento for="Hayato">\n \t<pickle></pickle>\n </bento>', '<plate for="Minato">\n \t<orange></orange>\n </plate>']
3.6 Selector list
Note: is a method to combine different selectors. It selects all nodes that can be selected by any selector in the list.
Syntax: element1,element2
Example: plate and Bento match both < plate > and < Bento > elements.
from parsel import Selector text = """ <div class="table"> <pickle class="small" /> <pickle /> <plate> <pickle /> </plate> <bento> <pickle /> </bento> <plate> <pickle /> </plate> <pickle /> <pickle class="small" /> </div> """ selector = Selector(text=text) selector.css('plate,bento').getall() # ['<plate>\n \t\t<pickle></pickle>\n </plate>', '<bento>\n \t<pickle></pickle>\n </bento>', '<plate>\n \t<pickle></pickle>\n </plate>']
3.7 descendant combiner
Syntax: element1 element2
Note: (space) is the descendant node of the previous element selected by the combiner to match all element2 elements within any element1 element.
from parsel import Selector text = """ <div class="table"> <bento /> <plate> <apple /> </plate> <apple /> </div> """ selector = Selector(text=text) selector.css('plate apple').getall() # ['<apple></apple>']
3.8 child combiner
Syntax: element1 > Element2
Note: > is the direct descendant node of the previous element selected by the combiner to match all element2 elements directly embedded in element1 element.
from parsel import Selector text = """ <div class="table"> <plate> <bento> <apple /> </bento> </plate> <plate> <apple /> </plate> <plate /> <apple /> <apple class="small" /> </div> """ selector = Selector(text=text) selector.css('plate>apple').getall() # ['<apple></apple>']
3.9 general sibling combiner
Syntax: element1~element2
Description: ~ the latter node is anywhere behind the former node and shares the same parent node. element1~element2 match all element2 elements after element1 element under the same parent element.
from parsel import Selector text = """ <div class="table"> <pickle /> <bento> <orange class="small" /> </bento> <pickle class="small" /> <pickle /> <plate> <pickle /> </plate> <plate> <pickle class="small" /> </plate> </div> """ selector = Selector(text=text) selector.css('pickle~pickle').getall() # ['<pickle class="small"></pickle>', '<pickle></pickle>']
3.10 adjacent sibling combiner
Syntax: element1+element2
Description: + the latter element follows the previous one and shares the same parent node. element1+element2 matches all element2 elements immediately after element1 element.
from parsel import Selector text = """ <div class="table"> <bento> <apple class="small" /> </bento> <plate /> <apple class="small" /> <plate /> <apple /> <apple class="small" /> <apple class="small" /> </div> """ selector = Selector(text=text) selector.css('plate+apple').getall() # ['<apple class="small"></apple>', '<apple></apple>']
3.11 Pseudo selector (Pseudo)
a pseudo class
: pseudo selectors support selecting elements based on status information that is not included in the document tree.
Syntax: first child
Example: P: first child selects each of the first child elements belonging to the parent element
Element.
Syntax:: nth child (n)
Example: P: nth child (2) selects each of the second child elements belonging to its parent element
Element.
Syntax:: nth of type (n)
Example: P: nth of type (2) selects the second element that belongs to its parent element
Each of the elements
Element.
Syntax:: not(selector)
Example:: not(p) select non
Each element of the element.
b pseudo element
:: pseudo selectors are used to represent entities that cannot be expressed in HTML semantics.
Example: P:: first line matches the first line of all < p > elements.
3.12 fun exercises
! [screenshot 2020-07-07 10.33.05. PNG]( https://img-blog.csdnimg.cn/img_convert/bb2205fc525ea28b76954c97078ae2e0.png#align=left&display=inline&height=493&margin= [object] & name = screen capture 2020-07-07 10.33.05 png&originHeight=1446&originWidth=1970&size=298670&status=done&style=none&width=672)
http://flukeout.github.io/
4. XPath syntax
XPath uses path expressions to select nodes or node sets in XML documents. Nodes are selected by following paths or steps.
<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="eng">Harry Potter</title> <price>29.99</price> </book> <book> <title lang="eng">Learning XML</title> <price>39.95</price> </book> </bookstore>
4.1 selecting nodes
expression | describe |
---|---|
/ | Select from root node |
// | Any node |
. | Select current node |
... | Select the parent node of the current node |
.// | Match all under a node |
@ | Select Properties |
nodename | Select all children of this node |
Path expression | effect |
---|---|
/bookstore | Select the root element bookstore |
//book | Select all book child elements |
bookstore | Select all child nodes of the bookstore element |
bookstore/book | Select all book elements that belong to the child elements of the bookstore |
bookstore//book | Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore |
//@href | Select all attribute values named href |
4.2 Predicates
The predicate is used to find a specific node or a node containing a specified value. The predicate is embedded in square brackets.
Path expression | effect |
---|---|
/bookstore/book[1] | Select the first book element that belongs to the bookstore child element |
/bookstore/book[last()] | Select the last book element that belongs to the bookstore child element |
/bookstore/book[last()-1] | Select the penultimate book element that belongs to the bookstore child element |
/bookstore/book[position()❤️] | Select the first two book elements that belong to the child elements of the bookstore element |
//title[@lang] | Select all title elements that have an attribute named lang |
//title[@lang='eng'] | Select all title elements that have a lang attribute with a value of eng |
/bookstore/book[price>35.00] | Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00 |
/bookstore/book[price>35.00]//title | Select all title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00 |
4.3 multipath selection
You can select several paths by using the | operator in the path expression.
Path expression | effect |
---|---|
//book/title | //book/price | Select all the title and price elements of the book element |
//title | //price | Select all title and price elements in the document |
/bookstore/book/title | //price | Select all the title elements of the book element belonging to the bookstore element and all the price elements in the document |
4.4 others
Path expression | effect |
---|---|
//input[@type='text' or @name='wd'] | Select all input elements in the document whose attribute type value is text or attribute name value is wd |
//*[contains(@class, 'ip')] | Match all elements in the document whose class attribute values contain the keyword ip |
//*[starts-with(@id, 'xx')] | All id attribute values in the matching document start with the element of the keyword xx |
string(.) | Extracts text from all child nodes of the current node |
reference material:
1 [https://parsel.readthedocs.io/en/latest/index.html](https://parsel.readthedocs.io/en/latest/index.html)
2 [https://www.w3school.com.cn/cssref/css_selectors.asp](https://www.w3school.com.cn/cssref/css_selectors.asp)
2 [https://www.runoob.com/xpath/xpath-syntax.html](https://www.runoob.com/xpath/xpath-syntax.html)