Python Crawl Doubles + Data Visualization

Posted by mtlhd on Tue, 21 Dec 2021 13:28:39 +0100

Blog Text and Source Download: Python Crawl Doubles + Data Visualization

Preface

At my sister's invitation, I saw the Python crawler a while ago. I have to say that Python's grammar is really concise and graceful, readable, close to natural language, and very suitable for beginners of programming.

Before you start, explain what crawls are:

Web crawler (English: web crawler), also known as spider, is a web robot used to automatically browse the World Wide Web. - Wikipedia

A crawler is a program or script that replaces manual browsing of a Web page and extracting information from it. The extracted information is usually stored and analyzed to obtain valuable information.

Crawlers aren't new either. It can be said that almost any programming language can do it, but Python is fairly friendly for beginners because of its concise syntax and rich third-party libraries for writing crawlers quickly and efficiently. Next, take the example of crawling the TOP250 page of Douban movie to illustrate how to use Python for crawling and data visualization.

Implementation Steps

1. HTTP Requests

Import Third Party Library requests , call requests.get() method direction Douban movie TOP250 The page initiates a GET request and gets the HTML of the response.

2. Data Extraction

Browser Access Douban movie TOP250 Page and enter developer mode to copy the XPath of the node to grab. Import Third Party Library lxml Etree object, calling etree.HTML(HTML) converts HTML into an element object and then uses the element. The XPath (XPath) method takes the text of the captured node (PS: regular matching text can also be used).

3. Data Storage

Import Third Party Library openpyxl , call openpyxl. The Workbook() method takes a new workbook and writes it to the workbook.active=sheet gets a reference sheet to the worksheet, adds data to the excel table at the specified location: sheet["A1"]=value, and finally uses workbook.save("excel table name") saves data, and does not use a database here to make it easy to view data without a programming base.

4. Continue crawling

Repeat the above steps according to the URL parameter of the crawled page, such as the Douban movie TOP250 has two parameters, start and filter. Start means that the movie ranking of the page starts after the TOP, and there are 25 movies on each page. That is to say, the first page parameter starts=0, the second page parameter starts=25, and the filter is for filtering, leaving it unused for the moment. The GET parameter start+25 is sufficient when requesting the next page.

V. Data Visualization

Call openpyxl. Load_ The Workbook ("excel table name") method takes the excel table that holds the data and writes it to the workbook.active=sheet gets a reference to a worksheet and gets the data for the specified location of the excel table: data=sheet["A1"]. value. Then import the third-party library pyecharts And according to File Invoke appropriate API s to generate charts for data visualization.

Crawl Development

Create a project Crawler to install third-party libraries for use:

pip install requests
pip install lxml
pip install openpyxl
pip install pyecharts

Next, create a new file html_in the project directory Parser. Py:

from lxml import etree


class HTMLParser:  # HTML parsing class
    def __init__(self):
        pass

    # Return data about movies with a specific serial number on Top250 page of Douban Movie
    def parse(self, html, number):
        movie = {"title": "", "actors": "", "classification": "", "score": "", "quote": ""}
        selector = etree.HTML(html)
        movie["title"] = selector.xpath(
            '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[1]/a/span[1]/text()')[0]
        movie["actors"] = selector.xpath(
            '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[1]/text()[1]')[0]
        movie["classification"] = selector.xpath(
            '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[1]/text()[2]')[0]
        movie["score"] = selector.xpath(
            '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/div/span[2]/text()')[0]
        # Value if it exists, otherwise it is empty
        movie["quote"] = selector.xpath(
            '//*[@id="content"]/div/div[1]/ol/li[' + str(number) + ']/div/div[2]/div[2]/p[2]/span/text()')[0] if len(
            selector.xpath('//*[@id="content"]/div/div[1]/ol/li[' + str(
                number) + ']/div/div[2]/div[2]/p[2]/span/text()')) > 0 else ""
        return movie

This module encapsulates a method to analyze the TOP250 page of Douban Movie and extract data.

Then create a new file, excel_handler.py:

import openpyxl


class ExcelHandler:  # excel file processing class
    __book = None
    __sheet = None

    def __init__(self):
        self.book = None
        self.sheet = None
        pass

    # Get the workbook and get references to the worksheet
    def startHandleExcel(self):
        self.book = openpyxl.Workbook()
        self.sheet = self.book.active

    # Add data to specified rows of columns A, B, C, D, E
    def handleExcel(self, row, A, B, C, D, E):
        self.sheet["A" + str(row)] = str(A).strip()
        self.sheet["B" + str(row)] = str(B).strip()
        self.sheet["C" + str(row)] = str(C).strip()
        self.sheet["D" + str(row)] = str(D).strip()
        self.sheet["E" + str(row)] = str(E).strip()
        return True

    # Save excel after processing
    def endHandleExcel(self, fileName):
        self.book.save(fileName)

    # Read excel and get a reference to the worksheet
    def startReadExcel(self, fileName):
        self.book = openpyxl.load_workbook(fileName)
        self.sheet = self.book.active

    # Read data from excel specified location
    def readExcel(self, coordinate):
        return str(self.sheet[coordinate].value)

    # Read complete
    def endReadExcel(self):
        pass

The module encapsulates how excel tables are stored and read.

Next, in mian. Write in py:

import requests
import random
import time

from html_parser import HTMLParser
from excel_handler import ExcelHandler
from pyecharts import options as opts
from pyecharts.charts import Bar


def crawling():  # Crawl Douban Movie Top250 Function
    # Define the rows of the excel table being processed
    excelRow = 1
    # Instantiate excel processing class
    excelHandler = ExcelHandler()
    # Begin processing excel
    excelHandler.startHandleExcel()
    # Add header row to excel table
    excelHandler.handleExcel(excelRow, "Name", "performer", "classification", "score", "Introduction")
    # Processed Line+1
    excelRow += 1

    # Define the address of the Top250 page of the Douban movie and the user-agent used (masquerading as a normal browser)
    url = "https://movie.douban.com/top250"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                             "Chrome/89.0.4389.82 Safari/537.36"}
    # Instantiate Page Resolution Class
    htmlParser = HTMLParser()
    print("Start crawling the Douban movie Top250...")
    # Click Ten Pages of the Douban Movie Top250
    for page in range(10):
        # Parameters defining URL s
        param = {"start": page * 25, "filter": ""}
        # Initiate GET Request
        response = requests.get(url=url, params=param, headers=headers).text
        # Save 25 movie information from the requested result page in excel
        for list in range(25):
            print("Processing" + str(page + 1) + "Page number" + str(list + 1) + "A movie...")
            # Analyzing the information of a movie
            movie = htmlParser.parse(response, list + 1)
            # Save parsed results in excel
            excelHandler.handleExcel(excelRow, movie["title"], movie["actors"], movie["classification"], movie["score"],
                                     movie["quote"])
            # Processed Line+1
            excelRow += 1
        print("No." + str(page + 1) + "Page crawl complete!")
        # Wait 5-20 seconds and crawl again to simulate human action
        time.sleep(random.randint(5, 20))
    # excel save complete
    excelHandler.endHandleExcel("movies.xlsx")
    print("Douban movie Top250 Crawl complete!")


def getCharts():  # Draw score data graph function
    # Dictionary defining ratings
    scoreLevel = {}
    # Instantiate excel processing class
    excelHandler = ExcelHandler()
    # Start reading excel table
    excelHandler.startReadExcel("movies.xlsx")
    print("Start reading excel Score column in table...")
    # Loop through excel table score columns
    for row in range(250):
        # Read score column from excel table as key of dictionary
        key = excelHandler.readExcel("D" + str(row + 2))
        # +1 if the key exists
        if key in scoreLevel:
            scoreLevel[key] += 1
        # Otherwise, initialize the key with a value of 1
        else:
            scoreLevel[key] = 1
    # End of reading excel
    excelHandler.endReadExcel()

    # Define a list to represent all keys (i.e. ratings) in the scoreLevel dictionary
    keys = []
    # Define a list to represent all values (that is, the number of scores) in the scoreLevel dictionary
    values = []
    # Extract key and value of scoreLevel
    for key in scoreLevel:
        keys.append(key)
        values.append(scoreLevel[key])
    print("Read rating data complete!")

    print("Start plotting score data...")
    # Draw Column Chart
    c = (
        Bar()
            .add_xaxis(keys)
            .add_yaxis("Score: quantity (division)", values)
            .set_global_opts(
            title_opts=opts.TitleOpts(title="Douban Movie Score Top250"),
            toolbox_opts=opts.ToolboxOpts(),
            legend_opts=opts.LegendOpts(is_show=False),
        )
            .render("movies_score.html")
    )
    print("Score data graph drawing complete!")


# Crawl Douban Movie TOP250
crawling()

# Map rating data
getCharts()

Main. The two functions in py implement data visualization by crawling the Douban Movie TOP250 and storing and reading data, respectively. The process is detailed in the notes.

Source code

Source download to access the original blog: Python Crawl Doubles + Data Visualization

Related Links

requests

lxml

openpyxl

pyecharts

Basic usage of the Requests library for Python web crawlers

python-lxml.etree parses html

Openpyxl Tutorial

Last

Python has many excellent crawler frameworks that you are interested in learning about yourself, such as:

Eight of the most efficient Python crawler frameworks, how many have you used?

Finally, crawlers can not crawl anything. The so-called "crawlers write well, prisons enter early". Before crawling the content, we must first look at it:

And other related specifications and restrictions.

Topics: Python Excel crawler data visualization xpath