Crawler 2: python+BS4 + regular expression grabs Douban movie data 2.0

Posted by anoopmail on Mon, 17 Jan 2022 07:12:52 +0100

preface

This time is to optimize the code of crawler 1 a few days ago, add a table style to center, and finally read the data from the table in the form of tabulation

1, Foreword

. Beautiful Soup transforms a complex HTML document into a complex tree structure. Each node is a Python object. The parser extracts the tag < item > of the data, and then the regular expression accurately crawls the content of the item tag to get the required data, saves it in the list, writes it into the table, and then reads the data to the output window for viewing. Still crawling Douban movie data.

2, Use steps

1. Import and storage

Import the required library name. openpyxl is used for table processing and re is used for regular expression processing

# - * -coding:GBK - * -
from bs4 import BeautifulSoup
from openpyxl.styles import Alignment
from openpyxl import Workbook,load_workbook
import requests
import re
# Define regular expressions
findsort = re.compile(r'<em class="">(.*?)</em>') # ranking
findhref = re.compile(r'<a href="(.*?)">')# address
findname = re.compile(r'<img alt="(.*?)"')# name
findsrc = re.compile(r'src="(.*?)"')# poster
findinfo = re.compile(r'<p class="">(.*?)</p>',re.S)# Movie information
findring = re.compile(r'<span class="rating_num" property="v:average">(.*?)</span>')# score
findjudge = re.compile(r'<span>(\d*)')# Number of comments
findquote = re.compile(r'<span class="inq">(\w*)')#Movie theme
data=[]# Used to traverse the loop to save the movie
datalist = []# Used to save data at the end of the cycle

2. Code analysis

1. Traverse to get movie data

def GetData(s):
    for u in range(10):
        u = u*25
        url = s+str(u)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
        html = requests.get(url = url, headers = headers).content
        soup = BeautifulSoup(html, 'html.parser')
        for item in soup.find_all('div', class_ = 'item'):# Return to list
            item = str(item)
            sort = re.findall(findsort,item)
            href = re.findall(findhref,item)
            name = re.findall(findname,item)
            src = re.findall(findsrc,item)
            rating = re.findall(findring,item)
            judge = re.findall(findjudge,item)
            quote = re.findall(findquote,item)
            info = re.findall(findinfo,item)
            info = ' '.join(info[0].split())
            info = info.replace("<br/>"," ")
            src.append(info)
            if len(quote)==0:
                quote.append('No data')
            list = sort+name+href+src+rating+judge+quote
            datalist.append(list)
    return datalist

Analysis: only 25 movie data are loaded on one page. You need to cycle through and splice the url to obtain the data of all movies. You need to cycle 10 times to obtain all the data. Judge the movie theme. If the data is empty, enter "no data", then add all the data lists to form a movie list, and finally append a movie data to the general table.

2. Save the summary table to the table

def SaveData(datalist):
    Save = Workbook()# Create table object
    sheet = Save.active# Get sheet object
    sheet.title=('Douban film')# Worksheet title
    sheet.append(['ranking','Movie name','Movie website','poster','Movie information','score','Number of evaluators','Movie theme'])
    # Set the table style for the corresponding column object of the instance table
    sheet.column_dimensions['A'].width='5'
    sheet.column_dimensions['B'].width='14'
    sheet.column_dimensions['E'].width='75'
    sheet.column_dimensions['F'].width='5'
    for data in datalist:
        sheet.append(data)
    # Center the data in columns A and F of the table
    colA = sheet['A']
    for cell in colA:
        cell.alignment = Alignment(horizontal='center',vertical='center')
    colF = sheet['F']
    for cell in colF:
        cell.alignment = Alignment(horizontal='center',vertical='center')    
    Save.save('Douban film top250-2.0.xlsx')

Analysis: set the width of the corresponding cells according to the data length, and then center the table column A and column F horizontally and vertically. Since sheet ['a'] is a tuple type, it is necessary to set the style by cyclic traversal output

3. Read out the data and load it in the output window

def ReadData():
    Write = load_workbook('Douban film top250-2.0.xlsx')
    table = Write.active
    rows = table.max_row
    cols = table.max_column
    for row in range(rows):
        for col in range(cols):  
            data = table.cell(row+1,col+1).value# Read data
            print(data,end=' ')
        print('')

Analysis: through load_workbook method, table instance, table object. Rows and cols obtain the maximum number of rows and columns of the table, which are used to traverse the output index, and output the table data in tabular form through nested loops

effect:

Form:

Output window:

Topics: crawler Python crawler