Regular Analysis and bs4 Analysis of Reptiles

Posted by steved on Tue, 06 Aug 2019 13:53:36 +0200

Regular parsing (re module)

Single character:

...: All characters except newlines
[]: [aoe] [a-w] matches any character in the set
\ d: Number [0-9]
\ D: Non-digital
\ w: Numbers, letters, underscores, Chinese
\ W: Nonw
\ s: All blank character packages, including spaces, tabs, page breaks, and so on. It is equivalent to [f n r t v].
\ S: Non-blank
Quantitative Modification:
* Any number of times >= 0
+ At least once >= 1
?: Is it possible to do it 0 or 1 times?
{m}: Kuding m times hello{3,}
{m,}: at least m times
{m,n}: m-n times
Borders:
$: End with _____________.
^ Start with ______________
Grouping:
(ab)
Greedy mode:.*
Non-greedy (inert) model: *?

Usage method

re.I: Ignore case and case

re.M: Multi-line matching

string = '''fall in love with you
i love you very much
i love she
i love her
'''
re.findall('^i.*',string,re.M)

re.S: One-line matching

# Match all rows
 String = ""< DVI > Reflect on Terror
 Your teammates are reading books
 Your enemies are sharpening their knives
 Your girlfriend is losing weight
 Next door Lao Wang is practicing waist
</div>
"""
re.findall('.*',string,re.S)

re.sub (regular expression, replacement content, string)

Example

Click the picture of Encyclopedia of Gongshi

import requests
import re 
import urllib
import os

url = 'https://www.qiushibaike.com/pic/page/%d/?s=5216960'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
# Create a folder for storing pictures
if not os.path.exists('./qiutu'):
    os.mkdir('./qiutu')
start_page = int(input('enter a start pageNum:'))
end_page = int(input('enter a end pageNum:'))
for page in range(start_page,end_page+1):
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    img_url_list = re.findall('<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',page_text,re.S)
    for img_url in img_url_list:
        img_url = "https:"+img_url
        imgName = img_url.split('/')[-1]
        imgPath = 'qiutu/'+imgName
        urllib.request.urlretrieve(url=img_url,filename=imgPath)
        print(imgPath,'Download success!')
print("over!")

bs4 analysis

Download bs4 and lxml

pip install bs4
pip install lxml

Analytical Principle

Load the source code to be parsed into the bs object
Calling related methods or attributes in bs object to locate related labels in source code
Getting text or attribute values between tags that will be located

Basic use

Use process:
- Guide: from BS4 import Beautiful Soup
- Usage: html can be converted into BeautifulSoup objects, and the specified node content can be found by attributes or attributes.
(1) Transform local files:
- Sop = BeautifulSoup (open ('local file'),'lxml')
(2) Transforming network files;
- Sop = BeautifulSoup ('string type or byte type','lxml')
(3) Print soup object to display content in html file

Basic method calls:
(1) Search by label name
- Sop. a can only find the first label that meets the requirements
(2) Getting attributes
- Sop.a.attrs retrieves all attributes and attribute values of a and returns a dictionary
- soup.a.attrs ['href'] Gets href attributes
- Sop.a ['href'] can also be abbreviated as this form.
(3) Access to content
- soup.a.string
- soup.a.text
- soup.a.get_text()
[Note] If the tag still has a tag, string returns None and the other two returns text.
(4) find: Find the first tag that meets the requirements
- soup.find('a') finds the first one that meets the requirements
- soup.find('a',title="xxx")
- soup.find('a',alt="xx")
- soup.find('a',class="xx")
- soup.find('a',id="xxxx")
(5) find_all: Find all tags that meet the requirements
- soup.find_all('a')
- soup.find_all(['a','b']) finds all a and B Tags
- soup.find_all('a', limit=2) limits the first two
(6) Specified content selected by selector
select:soup.select('#feng')
- Common selectors: label selector (a), class selector (.), id selector (#), hierarchical selector

Example

Download Romance Novels of the Three Kingdoms from Ancient Poetry and Literature Website Website

import requests
from bs4 import BeautifulSoup

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text

soup = BeautifulSoup(page_text,'lxml')

a_list = soup.select('.book-mulu > ul > li > a')

fp = open('sanguo.txt','w',encoding='utf-8')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com' + a['href']
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    soup = BeautifulSoup(detail_page_text,'lxml')
    content = soup.find('div',class_='chapter_content').text 
    fp.write(title+'\n'+content)
    print(title,'Download finished')
print('over')
fp.close()

Topics: Windows pip Attribute network

Programmer Think

Regular Analysis and bs4 Analysis of Reptiles

Regular parsing (re module)

Usage method

Example

bs4 analysis

Basic use

Example

Hot Topics