Visualization | analyze nearly 5000 tourist attractions in Python and tell you where to go during the holiday

Posted by malam on Sat, 19 Feb 2022 04:14:51 +0100

Hello, I'm Ou K.

The May Day holiday is coming. There is plenty of time for welfare (five-day holiday) this year. How do you want to play such a long holiday?

Play like this?

Still playing like this?

In this issue, we will briefly analyze the distribution of popular scenic spots and national travel in China through the sales of tickets in all provinces of qunar.com, and see which scenic spots are more popular, hoping to help our friends.

Contents involved:
request+json – web page data crawling
openpyxl – save data to Excel
pandas – table data processing
pyechars – data visualization

1. Web page analysis

Open the qunar ticket page: https://piao.qunar.com/ , take Beijing as an example:

You can see some recommended scenic spot information. F12 open the browser debugging window, find the url of the loaded data, click each page to observe the change of the website:

website:

'https://piao.qunar.com/ticket/list.json?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest&page=1'
'https://piao.qunar.com/ticket/list.json?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest&page=2'
'https://piao.qunar.com/ticket/list.json?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest&page=3'
'https://piao.qunar.com/ticket/list.json?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest&page=4'
'https://piao.qunar.com/ticket/list.json?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest&page=5'

Beijing =% E5%8C%97%E4%BA%AC, it can be seen that only the page value of each page of the website is changing, and the keyword parameter can be written in Chinese directly.
Take a look at the interface return information:

nice! The returned json string is simply not too good!

2. Crawl data

Preparatory work, import the following modules:

import os
import time
import json
import random
import requests
from fake_useragent import UserAgent
from openpyxl import Workbook, load_workbook

If the module is missing, it can be installed directly pip.

2.1 34 Administrative Regions

cities = ['Guangdong', 'Beijing', 'Shanghai', 'Tianjin', 'Chongqing', 'Yunnan', 'Heilongjiang', 'Inner Mongolia', 'Jilin',
          'Ningxia', 'Anhui', 'Shandong', 'Shanxi', 'Sichuan', 'Guangxi', 'Xinjiang', 'Jiangsu', 'Jiangxi',
          'Hebei', 'Henan', 'Zhejiang', 'Hainan', 'Hubei', 'Hunan', 'Macao', 'Gansu',
          'Fujian', 'Tibet', 'Guizhou', 'Liaoning', 'Shaanxi', 'Qinghai', 'Hong Kong', 'Taiwan']

2.2 crawling data of each administrative region

code:

def get_city_scenic(city, page):
    ua = UserAgent(verify_ssl=False)
    headers = {'User-Agent': ua.random}
    url = f'https://piao.qunar.com/ticket/list.json?keyword={city}&region=&from=mpl_search_suggest&sort=pp&page={page}'
    result = requests.get(url, headers=headers, timeout=10)
    result.raise_for_status()
    return  get_scenic_info(city, result.text)

Click "f-string" to refer to the specific usage of this article:
Tips | 5000 words super full parsing Python three formatting output methods [% / format / f-string]

2.3 crawl each page of data

code:

def get_scenic_info(city, response):
    response_info = json.loads(response)
    sight_list = response_info['data']['sightList']
    one_city_scenic = []
    for sight in sight_list:
        scenic = []
        name = sight['sightName'] # Name of scenic spot
        star = sight.get('star', None) # Stars
        score = sight.get('score', 0) # score
        price = sight.get('qunarPrice', 0) # Price
        sale = sight.get('saleCount', 0) # sales volume
        districts = sight.get('districts', None) # Province, city, district
        point = sight.get('point', None) # coordinate
        intro = sight.get('intro', None) # brief introduction
        free = sight.get('free', True) # Is it free
        address = sight.get('address', None) # Specific address
        scenic.append(city)
        scenic.append(name)
        scenic.append(star)
        scenic.append(score)
        scenic.append(price)
        scenic.append(sale)
        scenic.append(districts)
        scenic.append(point)
        scenic.append(intro)
        scenic.append(free)
        scenic.append(address)
        one_city_scenic.append(scenic)
    return one_city_scenic

You can crawl some fields according to your needs.

2.4 cyclic crawling of data on each page of each administrative region

code:

def get_city_info(cities, pages):
    for city in cities:
        one_city_info = []
        for page in range(1, pages+1):
            try:
                print(f'Crawling-{city}(province/city), The first{page}Page attraction data...')
                time.sleep(random.uniform(0.8,1.5))
                one_page_info = get_city_scenic(city, page)
            except:
                continue
            if one_page_info:
                one_city_info += one_page_info

Output some information so that you can easily view the current progress:

2.5 saving data

Here we use openpyxl to save the data to Excel. You can also try to save it to other files or databases:

def insert2excel(filepath,allinfo):
    try:
        if not os.path.exists(filepath):
            tableTitle = ['city','name','Stars','score','Price','sales volume','province/city/area','coordinate','brief introduction','Is it free','Specific address']
            wb = Workbook()
            ws = wb.active
            ws.title = 'sheet1'
            ws.append(tableTitle)
            wb.save(filepath)
            time.sleep(3)
        wb = load_workbook(filepath)
        ws = wb.active
        ws.title = 'sheet1'
        for info in allinfo:
            ws.append(info)
        wb.save(filepath)
        return True
    except:
        return False

result:

3. Data visualization

3.1 reading data

Import the following modules:

import os
import pandas as pd
from pyecharts.charts import Bar
from pyecharts.charts import Map
from pyecharts import options as opts

Use pandas module to traverse folders and read data:

# Traverse the files in the scene folder
def get_datas():
    df_allinfo = pd.DataFrame()
    for root, dirs, files in os.walk('./scenic'):
        for filename in files:
            try:
                df = pd.read_excel(f'./scenic/{filename}')
                df_allinfo = df_allinfo.append(df, ignore_index=True)
            except:
                continue
    # duplicate removal
    df_allinfo.drop_duplicates(subset=['name'], keep='first', inplace=True)
    return df_allinfo

3.2 data map of popular scenic spots

Take the top 20 ticket sales as an example:

def get_sales_bar(data):
    sort_info = data.sort_values(by='sales volume', ascending=True)
    c = (
        Bar()
        .add_xaxis(list(sort_info['name'])[-20:])
        .add_yaxis('Sales volume of popular scenic spots', sort_info['sales volume'].values.tolist()[-20:])
        .reversal_axis()
        .set_global_opts(
            title_opts=opts.TitleOpts(title='Sales data of popular scenic spots'),
            yaxis_opts=opts.AxisOpts(name='Name of scenic spot'),
            xaxis_opts=opts.AxisOpts(name='sales volume'),
            )
        .set_series_opts(label_opts=opts.LabelOpts(position="right"))
        .render('1-Popular scenic spot data.html')
        )

effect:

3.3 holiday travel data map distribution

code:

def get_sales_geo(data):
    df = data[['city','sales volume']]
    df_counts = df.groupby('city').sum()
    c = (
        Map()
        .add('Holiday travel distribution', [list(z) for z in zip(df_counts.index.values.tolist(), df_counts.values.tolist())], 'china')
        .set_global_opts(
        title_opts=opts.TitleOpts(title='Holiday travel data map distribution'),
        visualmap_opts=opts.VisualMapOpts(max_=100000, is_piecewise=True),
        )
        .render('2-Holiday travel data map distribution.html')
    )

effect:

It can be seen that the number of people traveling in Beijing, Shanghai, Jiangsu, Guangdong, Sichuan, Shaanxi and other places is relatively large.

3.4 number of 4A-5A scenic spots in each province

code:

def get_level_counts(data):
    df = data[data['Stars'].isin(['4A', '5A'])]
    df_counts = df.groupby('city').count()['Stars']
    c = (
        Bar()
            .add_xaxis(df_counts.index.values.tolist())
            .add_yaxis('4A-5A Number of scenic spots', df_counts.values.tolist())
            .set_global_opts(
            title_opts=opts.TitleOpts(title='Provinces and cities 4 A-5A Number of scenic spots'),
            datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_='inside')],
        )
        .render('3-Provinces and cities 4 A-5A Number of scenic spots.html')
    )

effect:

3.5 data map distribution of 4a-5a scenic spots

code:

def get_level_geo(data):
    df = data[data['Stars'].isin(['4A', '5A'])]
    df_counts = df.groupby('city').count()['Stars']
    c = (
        Map()
        .add('4A-5A Scenic spot distribution', [list(z) for z in zip(df_counts.index.values.tolist(), df_counts.values.tolist())], 'china')
        .set_global_opts(
        title_opts=opts.TitleOpts(title='Map data distribution'),
        visualmap_opts=opts.VisualMapOpts(max_=50, is_piecewise=True),
        )
        .render('4-4A-5A Scenic spot data map distribution.html')
    )

effect:

End.

The above is all the content sorted out for you in this issue. Practice it quickly. It's not easy to be original. Friends who like it can praise, collect and share it for more people to know

Programmer Think

Visualization | analyze nearly 5000 tourist attractions in Python and tell you where to go during the holiday

1. Web page analysis

2. Crawl data

2.1 34 Administrative Regions

2.2 crawling data of each administrative region

2.3 crawl each page of data

2.4 cyclic crawling of data on each page of each administrative region

2.5 saving data

3. Data visualization

3.1 reading data

3.2 data map of popular scenic spots

3.3 holiday travel data map distribution

3.4 number of 4A-5A scenic spots in each province

3.5 data map distribution of 4a-5a scenic spots

Recommended reading

Hot Topics