Parsel -- crawler page parsing tool

Posted by stevehossy on Mon, 24 Jan 2022 01:23:03 +0100

Parsel - Crawler page parsing tool

[statement]: This article is not original, but the source address of the article was not found

parsel is a product of sweep and a built-in selector of sweep. It includes re, css and xpath selectors. It can extract and delete data from HTML and XML. Using this library can make the code concise.

1 installation

pip install parsel or easy_install parsel

2 usage

2.1 create a Selector with HTML or XML to be parsed

from parsel import Selector
text = """
<html>
    <body>
    	<h1>Hello, Parsel!</h1>
        <div>
          <ul>
             <li class="item-0">
             	<a href="link1.html">first item</a>
             </li>
             <li class="item-1">
             	<a href="link2.html">second item</a>
             </li>
             <li class="item-inactive">
             	<a href="link3.html">third item</a>
             </li>
             <li class="item-1">
             	<a href="link4.html">fourth item</a>
             </li>
             <li class="item-0">
             	<a href="link5.html">fifth item</a>
             </li>
          </ul>
 		</div>
    </body>
</html>
"""
selector = Selector(text=text) # python2 text must be a unicode string.

2.2 then use CSS or Xpath expressions to extract the required information

selector.css('h1::text')
# [<Selector xpath='descendant-or-self::h1/text()' data='Hello, Parsel!'>]
selector.css('h1::text').get()
# 'Hello, Parsel!'
selector.xpath('//h1/text()').getall()
# ['Hello, Parsel!']
# Associative regularity
selector.xpath('//h1/text()').re_first(r'(H\w+)lo')
# Hel
selector.xpath('//h1/text()').re(r'H(\w+)lo, Par(\w+el)')
# ['el', 'sel']
# re EXSLT extension supporting xpath
selector.xpath(r'//li[re:test(@class, "item-\d$")]//@href').getall()
# selector.xpath('//li/a/@href').getall()
# ['link1.html', 'link2.html', 'link4.html', 'link5.html']

3 CSS syntax

3.1 Universal selector

Syntax:*

Description: * will match all elements of the document.

Example: div / * will match all child node elements within the < div > node.

from parsel import Selector
text = """
<div class="table">
  <plate />
  <plate />
</div>
"""
selector = Selector(text=text)
selector.css('div/*').getall()
# ['<div class="table">\n  <plate></plate>\n  <plate></plate>\n</div>']

3.2 Type selector

Syntax: elementname

Example: plate matches all < plate > elements.

from parsel import Selector
text = """
<div class="table">
  <plate />
  <bento />
  <plate />
</div>
"""
selector = Selector(text=text)
selector.css('plate').getall()
# ['<plate></plate>', '<plate></plate>']

3.3 Class selector

Syntax: classname

Example: Small matches any element in the class attribute that contains the "small" class.

from parsel import Selector
text = """
<div class="table">
  <apple />
	<apple class="small" />
	<plate>
		<apple class="small" />
	</plate>
<plate />

</div>
"""
selector = Selector(text=text)
selector.css('.small').getall()
# ['<apple class="small"></apple>', '<apple class="small"></apple>']

3.4 ID selector

Syntax: #idname

Example: #fancy matches the element with id "fancy".

from parsel import Selector
text = """
<div class="table">
  <plate id="fancy" />
  <plate />
  <bento />
</div>
"""
selector = Selector(text=text)
selector.css('#fancy').getall()
# ['<plate id="fancy"></plate>']

3.5 Attribute selector

Syntax: [attr] [attr=value] [attr~=value] [attr|=value] [attr^=value] [attr$=value] [attr*=value]

Exampleeffect
p[attr]Select all p elements with attr attribute.
p[attr=value]Select all p elements with attr attribute "value".
p[attr~=value]Select all p elements whose attr attribute contains the word "value".
p[attr^=value]Select each p element whose attr attribute value starts with "value".
p[attr$=value]Select each p element whose attr attribute value ends with "value".
p[attr*=value]Select each p element whose attr attribute contains a "value" substring.
from parsel import Selector
text = """
<div class="table">
  <apple class="small" />
  <bento for="Hayato">
  	<pickle />
  </bento>
  <apple for="Ryota" />
  <plate for="Minato">
  	<orange />
  </plate>
  <pickle class="small" />
</div>
"""
selector = Selector(text=text)
selector.css('[for$="o"]').getall()
# ['<bento for="Hayato">\n  \t<pickle></pickle>\n  </bento>', '<plate for="Minato">\n  \t<orange></orange>\n  </plate>']

3.6 Selector list

Note: is a method to combine different selectors. It selects all nodes that can be selected by any selector in the list.

Syntax: element1,element2

Example: plate and Bento match both < plate > and < Bento > elements.

from parsel import Selector
text = """
<div class="table">
  <pickle class="small" />
  <pickle />
  <plate>
 		<pickle />
  </plate>
  <bento>
  	<pickle />
  </bento>
  <plate>
  	<pickle />
  </plate>
  <pickle />
  <pickle class="small" />
</div>
"""
selector = Selector(text=text)
selector.css('plate,bento').getall()
# ['<plate>\n \t\t<pickle></pickle>\n  </plate>', '<bento>\n  \t<pickle></pickle>\n  </bento>', '<plate>\n  \t<pickle></pickle>\n  </plate>']

3.7 descendant combiner

Syntax: element1 element2

Note: (space) is the descendant node of the previous element selected by the combiner to match all element2 elements within any element1 element.

from parsel import Selector
text = """
<div class="table">
  <bento />
  <plate>
  	<apple />
  </plate>
  <apple />
</div>
"""
selector = Selector(text=text)
selector.css('plate apple').getall()
# ['<apple></apple>']

3.8 child combiner

Syntax: element1 > Element2

Note: > is the direct descendant node of the previous element selected by the combiner to match all element2 elements directly embedded in element1 element.

from parsel import Selector
text = """
<div class="table">
  <plate>
		<bento>
			<apple />
		</bento>
	</plate>
	<plate>
		<apple />
	</plate>
	<plate />
	<apple />
	<apple class="small" />
</div>
"""
selector = Selector(text=text)
selector.css('plate>apple').getall()
# ['<apple></apple>']

3.9 general sibling combiner

Syntax: element1~element2

Description: ~ the latter node is anywhere behind the former node and shares the same parent node. element1~element2 match all element2 elements after element1 element under the same parent element.

from parsel import Selector
text = """
<div class="table">
    <pickle />
    <bento>
    	<orange class="small" />
    </bento>
    <pickle class="small" />
    <pickle />
    <plate>
    	<pickle />
    </plate>
    <plate>
    	<pickle class="small" />
    </plate>
</div>
"""
selector = Selector(text=text)
selector.css('pickle~pickle').getall()
# ['<pickle class="small"></pickle>', '<pickle></pickle>']

3.10 adjacent sibling combiner

Syntax: element1+element2

Description: + the latter element follows the previous one and shares the same parent node. element1+element2 matches all element2 elements immediately after element1 element.

from parsel import Selector
text = """
<div class="table">
    <bento>
    	<apple class="small" />
    </bento>
    <plate />
    <apple class="small" />
    <plate />
    <apple />
    <apple class="small" />
    <apple class="small" />
</div>
"""
selector = Selector(text=text)
selector.css('plate+apple').getall()
# ['<apple class="small"></apple>', '<apple></apple>']

3.11 Pseudo selector (Pseudo)

a pseudo class
: pseudo selectors support selecting elements based on status information that is not included in the document tree.
Syntax: first child
Example: P: first child selects each of the first child elements belonging to the parent element

Element.

Syntax:: nth child (n)
Example: P: nth child (2) selects each of the second child elements belonging to its parent element

Element.

Syntax:: nth of type (n)
Example: P: nth of type (2) selects the second element that belongs to its parent element

Each of the elements

Element.

Syntax:: not(selector)
Example:: not(p) select non

Each element of the element.

b pseudo element
:: pseudo selectors are used to represent entities that cannot be expressed in HTML semantics.

Example: P:: first line matches the first line of all < p > elements.

3.12 fun exercises

! [screenshot 2020-07-07 10.33.05. PNG]( https://img-blog.csdnimg.cn/img_convert/bb2205fc525ea28b76954c97078ae2e0.png#align=left&display=inline&height=493&margin= [object] & name = screen capture 2020-07-07 10.33.05 png&originHeight=1446&originWidth=1970&size=298670&status=done&style=none&width=672)
http://flukeout.github.io/

4. XPath syntax

XPath uses path expressions to select nodes or node sets in XML documents. Nodes are selected by following paths or steps.

<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
 
<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
 
</bookstore>

4.1 selecting nodes

expressiondescribe
/Select from root node
//Any node
.Select current node
...Select the parent node of the current node
.//Match all under a node
@Select Properties
nodenameSelect all children of this node
Path expressioneffect
/bookstoreSelect the root element bookstore
//bookSelect all book child elements
bookstoreSelect all child nodes of the bookstore element
bookstore/bookSelect all book elements that belong to the child elements of the bookstore
bookstore//bookSelect all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore
//@hrefSelect all attribute values named href

4.2 Predicates

The predicate is used to find a specific node or a node containing a specified value. The predicate is embedded in square brackets.

Path expressioneffect
/bookstore/book[1]Select the first book element that belongs to the bookstore child element
/bookstore/book[last()]Select the last book element that belongs to the bookstore child element
/bookstore/book[last()-1]Select the penultimate book element that belongs to the bookstore child element
/bookstore/book[position()❤️]Select the first two book elements that belong to the child elements of the bookstore element
//title[@lang]Select all title elements that have an attribute named lang
//title[@lang='eng']Select all title elements that have a lang attribute with a value of eng
/bookstore/book[price>35.00]Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00
/bookstore/book[price>35.00]//titleSelect all title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00

4.3 multipath selection

You can select several paths by using the | operator in the path expression.

Path expressioneffect
//book/title | //book/priceSelect all the title and price elements of the book element
//title | //priceSelect all title and price elements in the document
/bookstore/book/title | //priceSelect all the title elements of the book element belonging to the bookstore element and all the price elements in the document

4.4 others

Path expressioneffect
//input[@type='text' or @name='wd']Select all input elements in the document whose attribute type value is text or attribute name value is wd
//*[contains(@class, 'ip')]Match all elements in the document whose class attribute values contain the keyword ip
//*[starts-with(@id, 'xx')]All id attribute values in the matching document start with the element of the keyword xx
string(.)Extracts text from all child nodes of the current node

reference material:
1 [https://parsel.readthedocs.io/en/latest/index.html](https://parsel.readthedocs.io/en/latest/index.html)
2 [https://www.w3school.com.cn/cssref/css_selectors.asp](https://www.w3school.com.cn/cssref/css_selectors.asp)
2 [https://www.runoob.com/xpath/xpath-syntax.html](https://www.runoob.com/xpath/xpath-syntax.html)

Topics: Python html crawler xpath