Python crawler introduction: pyquery Library Foundation

Posted by Elizabeth on Sat, 05 Oct 2019 15:25:15 +0200

Python crawler introduction: pyquery Library Foundation

Basic usage of pyquery

  • find Find Node
  • children child node
  • Parent parent parent parent node
  • parents ancestor node
  • siblings sibling node
  • items Get Content items
  • attr Gets Attributes
  • text Extraction
  • Getting html text from html
html = """
<div>
<ul class="list">
<li class="item-0">one</li>
<li class="item-1"><a href="www.csdn.net">two</a></li>
<li class="item-0" id="three"><span class="bold"><a href="www.baidu.com">three</a></span></li>
<li class="item-1 active"><a href="www.csdn.net">four</a></li>
<li class="item-0"><a href="www.csdn.net">five</a></li>
</ul>
</div>
"""
import requests
from pyquery import PyQuery as pq
doc = pq('https://www.qq.com')
print(doc('title'))
doc = pq(html)
print(doc.find('ul').children('.item-0'))
print('*'*20+'text'+'*'*20)
print(doc.find('ul').children('.item-0')[0].text)
print('*'*20+'parent'+'*'*20)
print(doc.find('ul').children('.item-0').parent())
print('*'*20+'parents'+'*'*20)
print(doc.find('ul').children('.item-0').parents())
print('*'*20+'siblings'+'*'*20)
print(doc.find('ul').children('.item-1.active').siblings())
print('*'*20+'adopt items()To traverse'+'*'*20)
for item in doc.find('ul').children('.item-0').items():
    print(item)
print('*'*20+'adopt attr Method Gets Attributes'+'*'*20)
print(doc.find('ul').children('.item-1.active').attr('class'))
print(doc.find('ul').children('.item-1.active').attr['class'])
print('*'*20+'adopt text Method Getting Text Content'+'*'*20)
print(doc.find('ul').children('.item-1.active').text())
print('*'*20+'adopt html Method Getting Content'+'*'*20)
print(doc.find('ul').children('.item-1.active').html())

Node operation

  • addClass()
  • removeClass()
  • attr('name','value') two parameters represent modification, and one parameter represents acquisition
  • text("new text") with parameters to modify content
  • html("new html") with parameters to modify content
  • remove() removes elements
  • append() inserts content at the end of the selected element
  • empty() empty element
  • prepend() inserts content at the beginning of the selected element
html = """
<div>
<ul class="list">
<li class="item-0">one</li>
<li class="item-1"><a href="www.csdn.net">two</a></li>
<li class="item-0" id="three"><span class="bold"><a href="www.baidu.com">three</a></span></li>
<li class="item-1 active"><a href="www.csdn.net">four</a></li>
<li class="item-0"><a href="www.csdn.net">five</a></li>
</ul>
</div>
"""
import requests
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-1.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('activate')
print(li)
li.attr('name','link')
print(li)
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
li.append('<img src="img.png"/>')
print(li)
li.prepend('<img src="img2.png"/>')
print(li)
li.remove('img')
print(li)
li.empty()
print(li)

Pseudo class operation

  • first-child first element
  • last-child last element
  • nth-child(2) second element
  • gt(2) The third later element (the fourth beginning)
  • nth-child(2n) step size is 2 to obtain elements, i.e. 1, 3, 5...
  • contains("four") Gets the element containing the specified string
html = """
<div class="wrap">
<div class="container">
<ul class="list">
<li class="item-0">one</li>
<li class="item-1"><a href="www.csdn.net">two</a></li>
<li class="item-0" id="three"><span class="bold"><a href="www.baidu.com">three</a></span></li>
<li class="item-1 active"><a href="www.csdn.net">four</a></li>
<li class="item-0"><a href="www.csdn.net">five</a></li>
</ul>
</div>
</div>
"""
import requests
from pyquery import PyQuery as pq
doc = pq(html)
# Get the first element
li = doc('li:first-child')
print(li)
# Get the last element
li = doc('li:last-child')
print(li)
# Get the second element
li = doc('li:nth-child(2)')
print(li)
# Get the third subsequent element
li = doc('li:gt(2)')
print(li)
# Set Step 2 to Get Elements
li = doc('li:nth-child(2n)')
print(li)
# Getting elements that contain something
li = doc('li:contains("four")')
print(li)

Topics: Python