Reptile learning notes

Posted by kumschick on Fri, 22 Oct 2021 15:42:17 +0200

1, What is a reptile?

The essence of a crawler is an application that sends a request to a website or URL, obtains resources, analyzes and extracts useful data. Can be used to obtain text data, can also be used to download pictures or music. Crawlers can verify hyperlinks and HTML code for web crawling. Web search engines and other sites update their own website content or index other websites through crawler software.

2, The working steps of the reptile.

(1) . obtain data

Make a request to the server based on the web address provided.

Determine the target url of the data to be crawled, the data to be carried when sending the request and various HTTP header information. The urlib library and the installed requests library are commonly used when sending requests.

(2) . analyze data

Analyze the web page structure (HTML, network request) (first look at the 0-th frame request to see whether the required data finds element in HTML, and then find XHR for dynamic refresh)

Analyze the returned data and extract the required information.

The commonly used data parsing libraries include html.parser provided by python and the installed third-party libraries beautiful soup4 and lxml.

Beautifuisoup4 is a tool for parsing web pages with the help of the structure and properties of web pages. It can automatically convert codes.

(3) . storing data

Sometimes the extracted data needs to be cleaned, and sometimes it will be directly stored in the database and written to disk files (such as csv files) or cache.

When the amount of crawling data is small, it can be stored in the form of documents, such as txt, json, csv and so on.

When the amount of data is large, you can use mysql to store some data.

3, Example of crawler (climb to the website of Chongqing University of Technology)

1. Download module

In cmd, enter:

pip install requests
 pip install BeautifulSoup4
 pip install lxml

2. Introduce the module and set some basic variables

import re   # Regular module
import requests  # Modules required to crawl web content
from bs4 import BeautifulSoup  # HTML parsing module import BS 
# bs4 parsing plug-in module
 
# Define web address
weburl = r"https://www.ctbu.edu.cn/"
 
# Picture saving directory, a directory established in advance.
dir = "images"
 
# Define a regular to determine whether the picture address is an absolute address beginning with http.
reg = re.compile(r'^http',re.I)

3. Use the requests module to request access to the specified website and obtain the HTML code.

# Set request headers to disguise the browser and accept Chinese data
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
    'Accept-Language': "zh-CN,zh;q=0.9"
}
 
# Start request page
req = requests.get( weburl, headers=headers )
req.encoding = req.apparent_encoding    # Set the accepted codes to avoid random codes
 
# If the status code is correct, get the requested HTML code
if req.status_code == 200:
    html = req.text   # str data, w+

4. Obtain the specified picture address through these HTML codes.

# Use BS to parse HTML and get the picture address
bs = BeautifulSoup( html ,'lxml')  # lxml needs to be downloaded
images = bs.select("img")  # Find all img tags in the page
imgSrc = []
for img in images:
    # imgSrc.append( weburl+img.get("src") )
    url = img.get("src")
    if not reg.match(url):  # If there is no http at the beginning of the picture, it means that this is an incomplete address, add the web address in front of it. Get a complete picture address.
        imgSrc.append( weburl+url )
    else:
        imgSrc.append( url )
print( imgSrc)  # Get all picture addresses

5. Use the for loop to save the addresses of all img tags in the list imgSrc, loop through the picture address list, and use · requests to request the picture address in turn.

for i in imgSrc:
    imgName =  i.split( "/" )[-1]  # Get the name of the picture
    imgSaveDir = dir + "/" + imgName # Storage address of pictures
    imgRq = requests.get(i)       # Access picture address
    with open('imgSaveDir',"w+") as f:
        f.write( 'imgRq.content ')
print( "Picture download complete" )

Through the above code, you can crawl the pictures of the website of Chongqing University of technology and industry, modify the website, and get the pictures of other websites.

Topics: Python crawler search engine