Blog Text and Source Download: Python Crawl Doubles + Data Visualization
Preface
At my sister's invitation, I saw the Python crawler a while ago. I have to say that Python's grammar is really concise and graceful, readable, close to natural language, and very suitable for beginners of programming.
Before you start, explain what crawls are:
Web crawler (English: web crawler), also known as spider, is a web robot used to automatically browse the World Wide Web. - Wikipedia
A crawler is a program or script that replaces manual browsing of a Web page and extracting information from it. The extracted information is usually stored and analyzed to obtain valuable information.
Crawlers aren't new either. It can be said that almost any programming language can do it, but Python is fairly friendly for beginners because of its concise syntax and rich third-party libraries for writing crawlers quickly and efficiently. Next, take the example of crawling the TOP250 page of Douban movie to illustrate how to use Python for crawling and data visualization.
Implementation Steps
1. HTTP Requests
Import Third Party Library requests , call requests.get() method direction Douban movie TOP250 The page initiates a GET request and gets the HTML of the response.
2. Data Extraction
Browser Access Douban movie TOP250 Page and enter developer mode to copy the XPath of the node to grab. Import Third Party Library lxml Etree object, calling etree.HTML(HTML) converts HTML into an element object and then uses the element. The XPath (XPath) method takes the text of the captured node (PS: regular matching text can also be used).
3. Data Storage
Import Third Party Library openpyxl , call openpyxl. The Workbook() method takes a new workbook and writes it to the workbook.active=sheet gets a reference sheet to the worksheet, adds data to the excel table at the specified location: sheet["A1"]=value, and finally uses workbook.save("excel table name") saves data, and does not use a database here to make it easy to view data without a programming base.
4. Continue crawling
Repeat the above steps according to the URL parameter of the crawled page, such as the Douban movie TOP250 has two parameters, start and filter. Start means that the movie ranking of the page starts after the TOP, and there are 25 movies on each page. That is to say, the first page parameter starts=0, the second page parameter starts=25, and the filter is for filtering, leaving it unused for the moment. The GET parameter start+25 is sufficient when requesting the next page.
V. Data Visualization
Call openpyxl. Load_ The Workbook ("excel table name") method takes the excel table that holds the data and writes it to the workbook.active=sheet gets a reference to a worksheet and gets the data for the specified location of the excel table: data=sheet["A1"]. value. Then import the third-party library pyecharts And according to File Invoke appropriate API s to generate charts for data visualization.
Crawl Development
Create a project Crawler to install third-party libraries for use:
pip install requests pip install lxml pip install openpyxl pip install pyecharts
Next, create a new file html_in the project directory Parser. Py:
from lxml import etree class HTMLParser: # HTML parsing class def __init__(self): pass # Return data about movies with a specific serial number on Top250 page of Douban Movie def parse(self, html, number): movie = {"title": "", "actors": "", "classification": "", "score": "", "quote": ""} selector = etree.HTML(html) movie["title"] = selector.xpath( '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[1]/a/span[1]/text()')[0] movie["actors"] = selector.xpath( '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[1]/text()[1]')[0] movie["classification"] = selector.xpath( '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[1]/text()[2]')[0] movie["score"] = selector.xpath( '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/div/span[2]/text()')[0] # Value if it exists, otherwise it is empty movie["quote"] = selector.xpath( '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[2]/span/text()')[0] if len( selector.xpath('//*[@id="content"]/div/div[1]/ol/li[' + str( number) + ']/div/div[2]/div[2]/p[2]/span/text()')) > 0 else "" return movie
This module encapsulates a method to analyze the TOP250 page of Douban Movie and extract data.
Then create a new file, excel_handler.py:
import openpyxl class ExcelHandler: # excel file processing class __book = None __sheet = None def __init__(self): self.book = None self.sheet = None pass # Get the workbook and get references to the worksheet def startHandleExcel(self): self.book = openpyxl.Workbook() self.sheet = self.book.active # Add data to specified rows of columns A, B, C, D, E def handleExcel(self, row, A, B, C, D, E): self.sheet["A" + str(row)] = str(A).strip() self.sheet["B" + str(row)] = str(B).strip() self.sheet["C" + str(row)] = str(C).strip() self.sheet["D" + str(row)] = str(D).strip() self.sheet["E" + str(row)] = str(E).strip() return True # Save excel after processing def endHandleExcel(self, fileName): self.book.save(fileName) # Read excel and get a reference to the worksheet def startReadExcel(self, fileName): self.book = openpyxl.load_workbook(fileName) self.sheet = self.book.active # Read data from excel specified location def readExcel(self, coordinate): return str(self.sheet[coordinate].value) # Read complete def endReadExcel(self): pass
The module encapsulates how excel tables are stored and read.
Next, in mian. Write in py:
import requests import random import time from html_parser import HTMLParser from excel_handler import ExcelHandler from pyecharts import options as opts from pyecharts.charts import Bar def crawling(): # Crawl Douban Movie Top250 Function # Define the rows of the excel table being processed excelRow = 1 # Instantiate excel processing class excelHandler = ExcelHandler() # Begin processing excel excelHandler.startHandleExcel() # Add header row to excel table excelHandler.handleExcel(excelRow, "Name", "performer", "classification", "score", "Introduction") # Processed Line+1 excelRow += 1 # Define the address of the Top250 page of the Douban movie and the user-agent used (masquerading as a normal browser) url = "https://movie.douban.com/top250" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/89.0.4389.82 Safari/537.36"} # Instantiate Page Resolution Class htmlParser = HTMLParser() print("Start crawling the Douban movie Top250...") # Click Ten Pages of the Douban Movie Top250 for page in range(10): # Parameters defining URL s param = {"start": page * 25, "filter": ""} # Initiate GET Request response = requests.get(url=url, params=param, headers=headers).text # Save 25 movie information from the requested result page in excel for list in range(25): print("Processing" + str(page + 1) + "Page number" + str(list + 1) + "A movie...") # Analyzing the information of a movie movie = htmlParser.parse(response, list + 1) # Save parsed results in excel excelHandler.handleExcel(excelRow, movie["title"], movie["actors"], movie["classification"], movie["score"], movie["quote"]) # Processed Line+1 excelRow += 1 print("No." + str(page + 1) + "Page crawl complete!") # Wait 5-20 seconds and crawl again to simulate human action time.sleep(random.randint(5, 20)) # excel save complete excelHandler.endHandleExcel("movies.xlsx") print("Douban movie Top250 Crawl complete!") def getCharts(): # Draw score data graph function # Dictionary defining ratings scoreLevel = {} # Instantiate excel processing class excelHandler = ExcelHandler() # Start reading excel table excelHandler.startReadExcel("movies.xlsx") print("Start reading excel Score column in table...") # Loop through excel table score columns for row in range(250): # Read score column from excel table as key of dictionary key = excelHandler.readExcel("D" + str(row + 2)) # +1 if the key exists if key in scoreLevel: scoreLevel[key] += 1 # Otherwise, initialize the key with a value of 1 else: scoreLevel[key] = 1 # End of reading excel excelHandler.endReadExcel() # Define a list to represent all keys (i.e. ratings) in the scoreLevel dictionary keys = [] # Define a list to represent all values (that is, the number of scores) in the scoreLevel dictionary values = [] # Extract key and value of scoreLevel for key in scoreLevel: keys.append(key) values.append(scoreLevel[key]) print("Read rating data complete!") print("Start plotting score data...") # Draw Column Chart c = ( Bar() .add_xaxis(keys) .add_yaxis("Score: quantity (division)", values) .set_global_opts( title_opts=opts.TitleOpts(title="Douban Movie Score Top250"), toolbox_opts=opts.ToolboxOpts(), legend_opts=opts.LegendOpts(is_show=False), ) .render("movies_score.html") ) print("Score data graph drawing complete!") # Crawl Douban Movie TOP250 crawling() # Map rating data getCharts()
Main. The two functions in py implement data visualization by crawling the Douban Movie TOP250 and storing and reading data, respectively. The process is detailed in the notes.
Source code
Source download to access the original blog: Python Crawl Doubles + Data Visualization
Related Links
Basic usage of the Requests library for Python web crawlers
Last
Python has many excellent crawler frameworks that you are interested in learning about yourself, such as:
Eight of the most efficient Python crawler frameworks, how many have you used?
Finally, crawlers can not crawl anything. The so-called "crawlers write well, prisons enter early". Before crawling the content, we must first look at it:
And other related specifications and restrictions.