Crawling watercress film Top250 and data analysis

The method of crawling Douban movie content is: first crawl the website of each movie on the main page (25 movies per page, a total of 10 pages), and then enter the website of each movie in turn to crawl the content.

1. Climb the main page

Before crawling, you need to find headers and URLs.

headers are mainly composed of user agent. Its function is to tell the HTTP server the name and version of the operating system and browser used by the client. You only need a user agent to crawl a general website.

url first page home site https://movie.douban.com/top250?start=0&filter= , the website on each page is the previous page, with start = followed by 25.

The next step is to use requests to start visiting the website. Requests generally have two methods: get and post. As shown in the figure below, it is determined to use get or post. Here is to use get.

Let's start visiting the website. If the output response is 200, the access is successful. On the contrary, the error source can be found according to the error number Click to view the complete list of error codes .

response = requests.get(url=url, headers=headers)

The next step is to use beautiful soup to start parsing the web page.

soup = BeautifulSoup(response.text, 'html.parser')

HTML here Parser, one of the four parsers, is often used. The resulting soup is HTML. So we can use soup to find the attributes we need.

If you do this, you've completed half the work of the reptile.

2. Finding elements in BeautifulSoup

As the saying goes, sharpening a knife doesn't make a mistake. When we search for elements, let's first learn about the method of finding elements in beautiful soup.

Next, let's introduce the following methods of using soup. The powerful CSS selector can complete most of our functions. Let's focus on the following:

1. find_all

The format is: soup find_ All ('tag ', attrs = {attribute name': 'attribute value'}) this format can be used in most cases, focusing on memory!

For example, if you crawl the home page movie name, you can see that the tag is span, the attribute name is class, and the attribute value is title, then you can apply find_all.

    list_name = soup.find_all('span',attrs={'class':'title'})   # The result is a list
    for i in list_name:
        print(i.get_text())         # If the type is Tag, you need to use get_text()

Some results obtained are:

You can see that all the tags in a web page are span, the attribute name is class, and the attribute value is title. So find is introduced here_ All is to find all the qualified content in the web page.

2. find

find_all is to find all the contents that meet the conditions, so it can be imagined that find is to find the first one that meets the conditions.

    name = soup.find('span',attrs={'class':'title'})
    print(name.get_text())
    print(name.string)      # string and get_ Same as text()

You can also directly find or find_all string, only those that meet the conditions will be found, for example:

    name = soup.find('span',attrs={'class':'title'},string='The Shawshank Redemption')
    print(name.get_text())
    print(name.string)      # string and get_ Same as text()

3. select

select generally has two methods to crawl the content. First, copy the selector. Second, find the label.

select can directly find the content to be crawled and copy the selector, as shown below:

    names = soup.select('#content > div > div.article > ol > li:nth-child(1) > div > div.info > div.hd > a > span:nth-child(1)')
    for name in names:
        print(name.get_text())

The result is only the first name of the first part of each home page. Therefore, we need to delete the first one: nth child (1) to crawl all the information. Because the selector is used to accurately locate an element, we must delete part of the location if we want to crawl all the content.

Some results obtained after deletion are:

The second method of select is similar to find. The key point is to replace class with point (because class is a class) and id with #. However, we don't need to find the label directly. The following method is to determine the label, as shown in the figure:

We have found the following "title" tags in the web page, so we have found all the following tags:

    names = soup.select('span.title')
    for name in names:
        print(name.get_text())

Summary: find is mainly used to find elements in beautiful soup_ All and select are two methods to find elements. Pay attention to the tag of crawling content and the advantages and disadvantages of these two methods.

Now we officially start crawling element information:

First of all, climb down the website of each film on the main page of each page, and we can directly analyze it in each film.

Find the tag href containing the movie URL. The method used here is to copy the selector and crawl

Analysis: if find is used_ All will get a lot of and unnecessary href, which is very troublesome, so we can accurately locate it.

    url_mv_list = soup.select('#content > div > div.article > ol > li > div > div.info > div.hd > a')
    print(url_mv_list)

Output url_mv_list gets a list. The elements in the list are all the information of a movie, as shown in the figure below,

So we just need to read the href of each element

    for index_url in range(len(url_mv_list)):
        url_mv = url_mv_list[index_url]['href']
        list_url_mv.append(url_mv)
        print(url_mv)

The result is the website of each film

3. Crawl each movie information

Then enter the website of each movie, analyze the web page, and then crawl the elements. The method is the same as crawling the main page. Here you can paste the code directly.

Before that, we need to consider the crawling content output format, because we finally need to input the crawling results into Excel. The method used is to change the data into dataframe and then write it into Excel. An example can well illustrate how to convert to dataframe.

a = [['a', '1', '2'], ['b', '3', '4'], ['c', '5', '6']]
df = pd.DataFrame(a, columns=['pan', 'panda', 'fan'])
print(df)

According to the results of the above example, we need to make a list of the information of each movie as the return value of the function, and then add the return value to a list, so that it can be converted into a dataframe.

# Process each film
def loading_mv(url,number):
    list_mv = []        # Add crawling information of each movie to it
    print('-----Processing page{}Film-----'.format(number+1))
    list_mv.append(number+1)        # ranking
    
    # Parsing web pages
    response_mv = requests.get(url=url,headers=headers)
    soup_mv = BeautifulSoup(response_mv.text,'html.parser')

    # Crawling movie title
    mv_name = soup_mv.find_all('span',attrs={'property':'v:itemreviewed'})      # Movie name
    mv_name = mv_name[0].get_text()
    list_mv.append(mv_name)
    # print(mv_name)
    
    # Crawling for the release time of the film
    mv_year = soup_mv.select('span.year')       # Film release time
    mv_year = mv_year[0].get_text()[1:5]
    list_mv.append(mv_year)
    # print(mv_year)
    
    # Crawling director information
    list_mv_director = []       # director
    mv_director = soup_mv.find_all('a',attrs={'rel':"v:directedBy"})
    for director in mv_director:
        list_mv_director.append(director.get_text())
    string_director = '/'.join(list_mv_director)        # Redefine format
    list_mv.append(string_director)
    # print(list_mv_director)
    
    # Crawling information
    list_mv_star = []           # to star
    mv_star = soup_mv.find_all('span',attrs={'class':'actor'})
    if mv_star == []:           # There was no starring role in part 210
        list_mv.append(None)
    else :
        mv_star = mv_star[0].get_text().strip().split('/')
        mv_first_star = mv_star[0].split(':')
        list_mv_star.append(mv_first_star[-1].strip())    
        del mv_star[0]           # Remove 'starring' field
        for star in  mv_star:
            list_mv_star.append(star.strip())
        string = '/'.join(list_mv_star)          # Redefine format
        list_mv.append(string)

    # Crawl movie type
    list_mv_type = []       # Film type
    mv_type = soup_mv.find_all('span',attrs={'property':'v:genre'})
    for type in mv_type:
        list_mv_type.append(type.get_text())
    string_type = '/'.join(list_mv_type)
    list_mv.append(string_type)
    # print(list_mv_type)

    # Crawl movie ratings
    mv_score = soup_mv.select('strong.ll.rating_num')       # score
    mv_score = mv_score[0].get_text()
    list_mv.append(mv_score)
    
    # Number of crawling evaluators
    mv_evaluation_num = soup_mv.select('a.rating_people')       # Number of evaluators
    mv_evaluation_num = mv_evaluation_num[0].get_text().strip()
    list_mv.append(mv_evaluation_num)

    # Crawling plot introduction
    mv_plot = soup_mv.find_all('span',attrs={"class":"all hidden"})     # Plot introduction
    if mv_plot == []:
         list_mv.append(None)
    else:
        string_plot = mv_plot[0].get_text().strip().split()
        new_string_plot = ' '.join(string_plot)
        list_mv.append(new_string_plot)

    # Join the movie website
    list_mv.append(url)

    return list_mv

Define a crawling function for each movie content, and then start calling the function:

First create a list_ all_ The MV list is used to store the return value of the calling function, that is, to store the information of each movie, as shown in the following figure,

list_all_mv = []

dict_mv_info = {}
for number in range(len(list_url_mv)):
    mv_info = loading_mv(list_url_mv[number],number)
    list_all_mv.append(mv_info)
print('-----End of operation-----')

pd = DataFrame(list_all_mv,columns=['Film ranking','Movie name','Release time','director','to star','Film type','Film rating','Number of evaluators','Film introduction','Movie link'])
# print(pd)

pd.to_excel(r'C:\Users\86178\Desktop\Douban film Top250.xlsx')

Finally, the top 250 excel table of Douban film is obtained, as shown in the following figure:

Attachment: when you use the same IP for too many visits, the website may block your IP, similar to:

HTTPSConnectionPool(host='movie.douban.com', port=443): Max retries exceeded with url: /subject/1292052/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002B3FBAB05C0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

terms of settlement:

1. It can be solved by changing a WiFi, because different WiFi will have different external IP, so when our IP is blocked, just change a WiFi.

2. Build agent IP pool. The principle of proxy IP pool is to find some available IPS and add them to requests, that is, specify an IP to access the website

The IP format is: 'http':'http://IP: Port ', such as' http':'http://119.14.253.128:8088 '

response = requests.get(url=url,headers=headers,proxies=ip,timeout=3)      # Request Baidu's server within 0.1 seconds

There are many ways to get IP

Free proxy IP http://ip.yqie.com/ipproxy.htm
 66 free agent network http://www.66ip.cn/
89 free agent http://www.89ip.cn/
Worry free agent http://www.data5u.com/
Cloud agent http://www.ip3366.net/
Fast agent https://www.kuaidaili.com/free/
Speed exclusive agent http://www.superfastip.com/
HTTP proxy IP https://www.xicidaili.com/wt/
Xiaoshu agent http://www.xsdaili.com
 Xila free proxy IP http://www.xiladaili.com/
Xiaohuan HTTP proxy https://ip.ihuan.me/
Network wide proxy IP http://www.goubanjia.com/
Feilong proxy IP http://www.feilongip.com/

Building a proxy IP pool is to crawl the IP and ports on these websites, and then make the crawled content into a standard format. In the future, I will publish a blog on building proxy IP pool.

2, Data analysis

As a saying goes, data is money. After we get the data, we need further analysis to play a greater role. Reading excel is the same as writing, and the result is dataframe.

def excel_to_dataframe(excel_path):
    df = pd.read_excel(excel_path,keep_default_na=False)        # keep_default_na=False results in '', not nan
    return df
excel_path = r'C:\Users\86178\Desktop\Douban film Top250.xlsx'
data_mv = excel_to_dataframe(excel_path)

The following will deal with the crawled content, such as release time, film type, starring or director.

1. Analysis of release time

Draw histogram

Draw pie chart

Draw a line chart

1. Based on the above three figures, the main films on the list are concentrated around 1993-2016.

2. It can be concluded that there were 12 or more films released in 1994, 2004 and 2010.

3. It can't be concluded that with the growth of time, the more films on the list.

2. Analysis of film types

Draw word cloud

Count all movie types and draw a word cloud, as shown in the following figure

Analyze the broken line diagram of a film type over time

Analysis of "plot"

Analysis of "science fiction"

Based on the above conclusions:

1. The film type "plot" has always been loved by people, especially when the film reached the peak of 12 in 1994. Combined with the analysis of time in the previous step, it can be concluded that all the films released in 1994 are of "plot" type, and up to now, they are still classic.

2. "Science fiction" films could not achieve such good results when science and technology were not developed in the early stage. However, with the development of time and the progress of science and technology, "science and technology" films have developed.

3. Analyze actors or directors

The ranking of actors or directors is mainly based on the film score, because it is recognized by people on the top 250 list, so we roughly judge the actor ranking according to the total score.

1. Top ten actors

2. Analysis of the score of an actor's film

Based on the above results, it can be concluded that:

1. The top ten actors are: Zhang Guorong, Liang Chaowei

2. Analyze the movie scores of an actor

III. complete code

1. Crawl code

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame

'''
    Finally, it was successfully extracted
    'Film ranking','Movie name','Release time','director','to star','Film type','Film rating','Number of evaluators','Movie link'
    Finally, the results are output to Douban film Top250.xlsx 
    But there are still problems: extracting languages and producing countries/Area when there is no selector The situation.
    To solve this problem, you may need to xpath
'''

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}

start_num = [i for i in range(0,226,25)]

list_url_mv = []        # URL of all movies

for start in start_num:
    url = 'https://movie.douban.com/top250?start={}&filter='.format(start)
    print('Processing url: ',url)

    response = requests.get(url=url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    url_mv_list = soup.select('#content > div > div.article > ol > li > div > div.info > div.hd > a')
    # print(url_mv_list)
    for index_url in range(len(url_mv_list)):
        url_mv = url_mv_list[index_url]['href']
        list_url_mv.append(url_mv)
        # print(url_mv)


# Process each film
def loading_mv(url,number):
    list_mv = []
    print('-----Processing page{}Film-----'.format(number+1))
    list_mv.append(number+1)        # ranking

    # Parsing web pages
    response_mv = requests.get(url=url,headers=headers)
    soup_mv = BeautifulSoup(response_mv.text,'html.parser')

    # Crawling movie title
    mv_name = soup_mv.find_all('span',attrs={'property':'v:itemreviewed'})      # Movie name
    mv_name = mv_name[0].get_text()
    list_mv.append(mv_name)
    # print(mv_name)

    # Crawling for the release time of the film
    mv_year = soup_mv.select('span.year')       # Film release time
    mv_year = mv_year[0].get_text()[1:5]
    list_mv.append(mv_year)
    # print(mv_year)

    # Crawling director information
    list_mv_director = []       # director
    mv_director = soup_mv.find_all('a',attrs={'rel':"v:directedBy"})
    for director in mv_director:
        list_mv_director.append(director.get_text())
    string_director = '/'.join(list_mv_director)        # Redefine format
    list_mv.append(string_director)
    # print(list_mv_director)

    # Crawling information
    list_mv_star = []           # to star
    mv_star = soup_mv.find_all('span',attrs={'class':'actor'})
    if mv_star == []:           # There was no starring role in part 210
        list_mv.append(None)
    else :
        mv_star = mv_star[0].get_text().strip().split('/')
        mv_first_star = mv_star[0].split(':')
        list_mv_star.append(mv_first_star[-1].strip())
        del mv_star[0]           # Remove 'starring' field
        for star in  mv_star:
            list_mv_star.append(star.strip())
        string = '/'.join(list_mv_star)          # Redefine format
        list_mv.append(string)

    # Crawl movie type
    list_mv_type = []       # Film type
    mv_type = soup_mv.find_all('span',attrs={'property':'v:genre'})
    for type in mv_type:
        list_mv_type.append(type.get_text())
    string_type = '/'.join(list_mv_type)
    list_mv.append(string_type)
    # print(list_mv_type)

    # Crawl movie ratings
    mv_score = soup_mv.select('strong.ll.rating_num')       # score
    mv_score = mv_score[0].get_text()
    list_mv.append(mv_score)

    # Number of crawling evaluators
    mv_evaluation_num = soup_mv.select('a.rating_people')       # Number of evaluators
    mv_evaluation_num = mv_evaluation_num[0].get_text().strip()
    list_mv.append(mv_evaluation_num)

    # Crawling plot introduction
    mv_plot = soup_mv.find_all('span',attrs={"class":"all hidden"})     # Plot introduction
    if mv_plot == []:
         list_mv.append(None)
    else:
        string_plot = mv_plot[0].get_text().strip().split()
        new_string_plot = ' '.join(string_plot)
        list_mv.append(new_string_plot)

    # Join the movie website
    list_mv.append(url)

    return list_mv

# url1 = 'https://movie.douban.com/subject/1292052/'
# url2 = 'https://movie. douban. COM / subject / 26430107 / '# 210
# a = loading_mv(url1,1)
# # b = loading_mv(url2,210)
# # list_all_mv.append(a)
# # list_all_mv.append(b)


list_all_mv = []

dict_mv_info = {}
for number in range(len(list_url_mv)):
    mv_info = loading_mv(list_url_mv[number],number)
    list_all_mv.append(mv_info)
print('-----End of operation-----')

pd = DataFrame(list_all_mv,columns=['Film ranking','Movie name','Release time','director','to star','Film type','Film rating','Number of evaluators','Film introduction','Movie link'])
# print(pd)

pd.to_excel(r'C:\Users\86178\Desktop\Douban film Top250.xlsx')

2. Data analysis code

'''
Yes, climb to the Douban film Top250 Conduct data analysis
 Analysis content:
1. Time: time analysis
            Draw histogram
            Pie chart
            Line chart - Movie
2.Pair type: movie type changes over time
            Draw movie types over time
            Movie type word cloud
3.For the leading actor or director: analyze the actor or director according to the film score
            Top ten stars
            Query actor/Director acting information
            Information of all actors and directors
'''

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import wordcloud
import imageio


# csv_path = 'Douban film Top250.csv'     # It cannot be processed with csv, and errors may occur

# Read excel and convert it into dataframe for easy reading
def excel_to_dataframe(excel_path):
    df = pd.read_excel(excel_path,keep_default_na=False)        # keep_default_na=False results in '', not nan
    return df
excel_path = r'C:\Users\86178\Desktop\Douban film Top250.xlsx'
data_mv = excel_to_dataframe(excel_path)


dict_time = {}
for time in data_mv['Release time']:
    dict_time[time] = dict_time.get(time,0)+1

list_time = list(dict_time.items())
list_time.sort(key=lambda x:x[1],reverse=True)
list_year = []  # particular year
list_times = []     # Number of occurrences
for t in list_time:
    list_year.append(t[0])
    list_times.append(t[1])


# Draw histogram
def make_Histogram(list_x,list_y,color):
    # Solve the problem of Chinese display
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    plt.bar(list_x,list_y,width=1,color=color)
    plt.title('Histogram of film release time and number of films produced')
    plt.xlabel('Film release time')
    plt.ylabel('Number of films per year')
    plt.show()

# make_Histogram(list_year,list_times,color=['g','y','m'])     # Draw a histogram of movie year occurrences

# Draw pie chart
def make_Pie(list_times,list_year):
    mpl.rcParams['font.sans-serif'] = ['KaiTi', 'SimHei', 'FangSong']  # For Chinese font, regular script is preferred. If regular script cannot be found, bold script is used
    mpl.rcParams['font.size'] = 12  # font size
    mpl.rcParams['axes.unicode_minus'] = False  # The negative sign is displayed normally

    plt.figure(figsize=(10,10),dpi=100)     # The size of the view
    plt.pie(list_times,                     # Specify drawing data
            labels=list_year,               # Add labels outside the pie chart circle
            autopct='%1.2f%%',              # Format percentage
            textprops={'fontsize':10},      # Set the attribute font size and color in the pie chart
            labeldistance=1.05)             # Set the distance between each sector label (legend) and the center of the circle
    # plt.legend(fontsize=7)                 # Set pie chart indication
    plt.title('Proportion of films per year')
    plt.show()

pie_other = len([i for i in list_time if i[1]==1])      # Classify films with a year of 1 into other categories
list_pie_year = []
list_pie_times = []

for i in list_time:
    if i[1] == 1:
        break
    else :
        list_pie_year.append(i[0])
        list_pie_times.append(i[1])
list_pie_year.append('Other films are 1 years')
list_pie_times.append(pie_other)
#
# make_Pie(list_pie_times,list_pie_year)
# make_Pie(list_times,list_year)

# Draw discount chart
def make_Plot(list_year,list_times):
    # Solve the problem of Chinese display
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    plt.title('Discount chart of number of films per year')
    plt.xlabel('Film release time')
    plt.ylabel('Number of films per year')
    plt.plot(list_year, list_times)
    plt.show()

list_plot_year = []
list_plot_times = []
list_time.sort(key=lambda x:int(x[0]))
for t in list_time:
    list_plot_year.append(t[0])
    list_plot_times.append(t[1])
# make_Plot(list_plot_year,list_plot_times)

mv_type = data_mv['Film type']
dict_type = {}
for type in mv_type:
    line = type.split('/')
    for t in line:
        dict_type[t] = dict_type.get(t,0) + 1
list_type = list(dict_type.items())
list_type.sort(key=lambda x:x[1],reverse=True)


# Draw word cloud
def c_wordcloud(ls):
    # string1 = ' '.join(ls)
    gpc=[]
    for i in ls:
        gpc.append(i[0])
    string1=" ".join('%s' % i for i in gpc)
    color_mask=imageio.imread(r"logo.jpg")
    wc = wordcloud.WordCloud(random_state=30,
                             width=600,
                             height=600,
                             max_words=30,
                             background_color='white',
                             font_path=r'msyh.ttc',
                             mask=color_mask
                             )
    wc.generate(string1)
    plt.imshow(wc)
    plt.show()
    # wc.to_file(path)
# c_wordcloud(list_type)


# [year, movie type]
list_time_type = []
for i in range(250):
    line = data_mv['Film type'][i].split('/')
    for j in line:
        time_type = []
        time_type.append(data_mv['Release time'][i])
        time_type.append(j)
        list_time_type.append(time_type)

dict_time_type = {}
for i in list_time_type:
    dict_time_type[tuple(i)] = dict_time_type.get(tuple(i),0) + 1
list_num_time_type = list(dict_time_type.items())
list_num_time_type.sort(key=lambda x:x[1],reverse=True)


# The development history of making a film type (in terms of film type)
def mv_time_type(type_name):
    list_mv_type = []
    for num in list_num_time_type:
        if num[0][1] == type_name:
            list_mv_type.append(num)
    list_mv_type.sort(key=lambda x:x[0][0],reverse=False)
    list_year = []
    list_times = []
    for t in list_mv_type:
        list_year.append(t[0][0])
        list_times.append(t[1])

    # Solve the problem of Chinese display
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    plt.title('Film type"{}"Development history of'.format(type_name))
    plt.xlabel('particular year')
    plt.ylabel('Number of occurrences per year')
    plt.plot(list_year,list_times)
    plt.show()

# mv_time_type('plot ')
# mv_time_type('science fiction')      # Mainly after 2000



# Calculate the score and total score of each work directed and starring
def people_score(peo_dir_star):
    list = []
    for num in range(250):
        if data_mv[peo_dir_star][num] == '':
            continue
        else:
            peoples = data_mv[peo_dir_star][num].split('/')
        for people in peoples:
            list_p_s = []
            list_p_s.append(people)
            list_p_s.append(data_mv['Film rating'][num])
            list_p_s.append(data_mv['Film ranking'][num])
            list_p_s.append(data_mv['Movie name'][num])
            list.append(list_p_s)
    return list

list_director = people_score('director')
list_star = people_score('to star')


# Best director or actor - based on the total score
def best_people(list_people):

    dict_people = {}
    for i in list_people:
        dict_people[i[0]] = dict_people.get(i[0],[]) + [(i[1],i[2],i[3])]

    for i in dict_people.items():
        i[1].append(float('{:.2f}'.format(sum([j[0] for j in i[1]]))))
    # ('Gong Li', [(9.6, 2, 'Farewell My Concubine'), (9.3, 30, 'alive'), (8.7, 109, 'Tang Bohu points Qiuxiang'),'27.60 '])

    list_new_people = list(dict_people.items())
    list_new_people.sort(key=lambda x:x[1][-1],reverse=True)

    print('The search is over, please start your operation (enter a number)!\n---Enter 1 Top 10 stars---\n---Enter 2 to search for the actor's performance---\n---Input 3 output all actors---')
    print('-----input enter sign out-----')

    select_number = input('Start input operation:')
    while select_number != '':

        if select_number == '1':
            print('Performance information of top ten actors:')
            list_all_score = []     # Total score
            list_prople_name = []
            for i in list_new_people[0:10]:
                print(i)

                list_prople_name.append(i[0])
                list_all_score.append(i[1][-1])

            # Solve the problem of Chinese display
            plt.rcParams['font.sans-serif'] = ['SimHei']
            plt.rcParams['axes.unicode_minus'] = False

            # plt.figure(figsize=(10, 10), dpi=100)  # The size of the view
            plt.title('Total score of top ten actors')
            plt.xlabel('performer')
            plt.ylabel('Total score')
            plt.bar(list_prople_name,list_all_score,width=0.5)
            plt.show()

        elif select_number == '2':
            # star_name = input('enter the actor name you want to know: ')
            star_name = ' '
            while star_name != '':
                star_name = input('Enter the actor name you want to know:')
                list_mv_name = []       # Movie name
                list_mv_score = []      # Film rating
                for number,i in enumerate(list_new_people):
                    if star_name == i[0]:
                        all_score = i[1][-1]      # Total score
                        del i[1][-1]
                        for j in i[1]:
                            list_mv_name.append(j[2])
                            list_mv_score.append(j[0])
                            print('{} Starring in Douban film Top250 Medium ranking{}Yes<{}>The score is {}'.format(star_name,j[1],j[2],j[0]))
                        print("{}Co starring{}The total score of the film is{}，Ranked third among all actors{}".format(star_name,len(i[1]),all_score,number+1))
                        print('End of query!')

                        # Calculation pie chart
                        def pie_mv_score():
                            mpl.rcParams['font.sans-serif'] = ['KaiTi', 'SimHei','FangSong']  # For Chinese font, regular script is preferred. If regular script cannot be found, bold script is used
                            mpl.rcParams['font.size'] = 12  # font size
                            mpl.rcParams['axes.unicode_minus'] = False  # The negative sign is displayed normally

                            plt.figure(figsize=(10,10))
                            plt.pie(list_mv_score,
                                    labels=list_mv_name,
                                    autopct='%1.2f%%',      # Calculate percentage, set and format
                                    textprops={'fontsize': 10})
                            plt.title('{}Total score ratio of starring films---The total ranking is{}'.format(star_name,number+1))
                            plt.show()
                        pie_mv_score()

                        break

                else:
                    print('There is no such person!')
                    break

        elif select_number == '3':
            for i in list_new_people:
                print(i)

        else :
            print('No such operation!')

        select_number = input('After the query, you can continue to enter the query serial number:')

    print('-----End of query-----')

best_people(list_star)

Topics: Python

Programmer Think