Python crawler actual combat + data analysis + data visualization (car home)

Posted by Base on Mon, 07 Mar 2022 21:07:56 +0100

With the development of economy and the progress of science and technology, the car has become a necessary means of transportation for every family. In addition, the prerequisite for marriage is to have a car and a house, which virtually aggravates the pressure of male compatriots. At this time, we need a car urgently. The second-hand car market has been very hot in recent years, increasing the ways for male compatriots to buy cars, Therefore, bloggers provide corresponding opinions for the majority of male compatriots through detailed visual analysis of second-hand cars in Jiangsu Province

1, Reptile part
Reptile Description:
1. The crawler is based on object-oriented code architecture
2. The data crawled by this crawler is stored in MongoDB database (the converted. xlsx file is provided)
3. There are detailed comments in the crawler code
4. The data crawled by crawlers take used cars in Jiangsu Province as an example

Code display

import re
from pymongo import MongoClient
import requests
from lxml import html

class CarHomeSpider(object):
    def __init__(self):
        self.start_url = 'http://www.che168.com/jiangsu/list/'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
        }
        self.url_temp = 'http://www.che168.com/jiangsu/{}/a0_0msdgscncgpi1ltocsp{}exx0/?pvareaid=102179#currengpostion'
        self.client = MongoClient()
        self.collection = self.client['test']['car_home']

    def get_url_list(self,sign,total_count):
        url_list = [self.url_temp.format(sign,i) for i in range(1,int(total_count)+1)]
        return url_list

    def parse(self,url):
        resp = requests.get(url,headers=self.headers)
        return resp.text

    def get_content_list(self,raw_html):
        resp_html = html.etree.HTML(raw_html)
        car_list = resp_html.xpath('//ul[@class="viewlist_ul"]/li')
        for car in car_list:
            item = {}
            # Get the title information of the car
            card_name = car.xpath('.//h4[@class="card-name"]/text()')
            card_name = card_name[0] if len(card_name)>0 else ''
            car_series = re.findall(r'(.*?) \d{4}paragraph',card_name)
            item['car_series'] = car_series[0].replace(' ','') if len(car_series)>0 else ''
            car_time_style = re.findall(r'.*? (\d{4})paragraph',card_name)
            item['car_time_style'] = car_time_style[0] if len(car_time_style)>0 else ''
            car_detail = re.findall(r'\d{4}paragraph (.*)',card_name)
            item['car_detail'] = car_detail[0].replace(' ','') if len(car_detail)>0 else ''

            # Get car details
            card_unit = car.xpath('.//p[@class="cards-unit"]/text()')
            card_unit = card_unit[0].split('/') if len(card_unit)>0 else ''
            item['car_run'] = card_unit[0]
            item['car_push'] = card_unit[1]
            item['car_place'] = card_unit[2]
            item['car_rank'] = card_unit[3]

            # Get the price of the car
            car_price = car.xpath('./@price')
            item['car_price'] = car_price[0] if len(car_price)>0 else ''
            print(item)
            self.save(item)

    def save(self,item):
        self.collection.insert(item)

    def run(self):
        # First, request the home page to obtain page classification data
        rest = self.parse(self.start_url)
        rest_html = html.etree.HTML(rest)
        # Here is the classification form according to the price, for example: below 30000, 30000-50000-80000, 80000-100000, 100000-150000, 150000-200000, 200000-300000, 300000-500000, and above 500000
        price_area_list = rest_html.xpath('//div[contains(@class,"condition-price")]//div[contains(@class,"screening-base")]/a')
        if price_area_list:
            for price_area in price_area_list:
                price_area_text = price_area.xpath('./text()')[0]
                price_area_link = 'http://www.che168.com'+price_area.xpath('./@href')[0]
                # Get the url of each category and make a request to get the total number of pages under each category
                rest_ = self.parse(price_area_link)
                rest_html_ = html.etree.HTML(rest_)
                total_count = rest_html_.xpath('//div[@id="listpagination"]/a[last()-1]/text()')[0]
                # Get the unique identification of each classification url
                sign = re.findall(r'jiangsu/(.*?)/#pvareaid',price_area_link)[0]
                # Generate the url addresses of all pages under each category
                url_list = self.get_url_list(sign,total_count)
                for url in url_list:
                    raw_html = self.parse(url)
                    self.get_content_list(raw_html)

if __name__ == '__main__':
    car_home = CarHomeSpider()
    car_home.run()

2, Data analysis and data visualization

Description of data analysis and data visualization:
1. This blog carries out data analysis and data visualization through the flash framework
2. The architecture diagram of the project is

Code display

  • Data analysis code display (analysis.py)
import re

from pymongo import MongoClient
import pandas as pd
import numpy as np
import pymysql

def pre_process(df):
    """
    Data preprocessing function
    :param df: dataFrame
    :return: df
    """

    # Remove the driving distance unit in 10000 kilometers from the data to facilitate subsequent calculation, such as 12000 kilometers
    df['car_run'] = df['car_run'].apply(lambda x:x.split('Ten thousand kilometers'))

    # Add data to car_ There are unlicensed data deleted in the push field
    df['car_push'] = df['car_push'].apply(lambda x:x if not x=="Unlicensed" else np.nan)

    # Delete data with NAN in the field
    df.dropna(inplace=True)

    return df



def car_brand_count_top10(df):
    """
    Calculate the top ten in the number of different brands
    :param df: dataFrame
    :return: df
    """
    # Classify according to the brand of the car
    grouped = df.groupby('car_series')['car_run'].count().reset_index().sort_values(by="car_run",ascending=False)[:10]
    data = [[i['car_series'],i['car_run']] for i in grouped.to_dict(orient="records")]
    print(data)
    return data

def car_use_year_count(df):
    """
    Calculate the service time of used cars
    :param df: dataFrame
    :return: df
    """
    # Deal with the sale time of the car
    date = pd.to_datetime(df['car_push'])
    date_value = pd.DatetimeIndex(date)
    df['car_push_year'] = date_value.year
    # Convert data type to int
    df['car_time_style'] = df['car_time_style'].astype(np.int)
    df['car_push_year'] = df['car_push_year'].astype(np.int)
    df['cae_use_year'] = df['car_push_year']-df['car_time_style']
    # Classify the service life of the vehicle
    grouped = df.groupby('cae_use_year')['car_series'].count().reset_index()
    # Delete the fields with negative service life and group them according to the service life into < one year, one year ~ three years > three years
    grouped = grouped.query('cae_use_year>=0')
    grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:"<a year" if x==0 else x )
    grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:"a year~three years" if not x =='<a year' and x>0 and x<3 else x )
    grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:">three years" if not x =='<a year' and not x=="a year~three years" and x>=3 else x )
    # Then they are grouped according to different service life
    grouped_use_year = grouped.groupby('cae_use_year')['car_series'].sum().reset_index()
    data = [[i['cae_use_year'],i['car_series']] for i in grouped_use_year.to_dict(orient="records")]
    print(data)
    return data

def car_place_count(df):
    """
    Number of used cars in different regions
    :param df: dataFrame
    :return: df
    """
    grouped =  df.groupby('car_place')['car_series'].count().reset_index()
    data = [[i['car_place'],i['car_series']] for i in grouped.to_dict(orient="records")]
    print(data)
    return data

def car_month_count(df):
    """
    Calculate the number of used cars per month
    :param df: dataFrame
    :return: df
    """
    # Deal with the sale time of the car
    date = pd.to_datetime(df['car_push'])
    date_value = pd.DatetimeIndex(date)
    month = date_value.month
    df['car_push_month'] = month

    # Group the months of car sale
    grouped = df.groupby('car_push_month')['car_series'].count().reset_index()
    data = [[i['car_push_month'],i['car_series']] for i in grouped.to_dict(orient="records")]
    print(data)
    return data

def save(cursor,sql,data):
    result = cursor.executemany(sql,data)
    if result:
        print('Insert successful')

if __name__ == '__main__':
    # 1 get data from MongoDB
    # Initialize MongoDB data connection
    # client = MongoClient()
    # collections = client['test']['car_home']
    # Get MongoDB data
    # cars = collections.find({},{'_id':0})

    # 2. Read xlsx file data (the data in MongoDB has been converted to xlsx format)
    cars = pd.read_excel('./carhome.xlsx',engine='openpyxl')

    # Convert data to dataFrame type
    df = pd.DataFrame(cars)
    print(df.info())
    print(df.head())

    # Preprocess data
    df = pre_process(df)

    # Calculate the top ten in the number of different brands
    data1 = car_brand_count_top10(df)

    # Calculate the service time of used cars
    data2 = car_use_year_count(df)

    # Calculate the number of used cars in different regions
    data3 = car_place_count(df)

    # Calculate the number of used cars per month
    data4 = car_month_count(df)

    # Create mysql connection
    conn = pymysql.connect(user='root',password='123456',host='localhost',port=3306,database='car_home',charset='utf8')
    try:
        with conn.cursor() as cursor:
            # Calculate the top ten in the number of different brands
            sql1 = 'insert into db_car_brand_top10(brand,count) values(%s,%s)'
            save(cursor,sql1,data1)

            # Calculate the service time of used cars
            sql2 = 'insert into db_car_area(area,count) values(%s,%s)'
            save(cursor,sql2,data2)

            # Calculate the number of used cars in different regions
            sql3 = 'insert into db_car_use_year(year_area,count) values(%s,%s)'
            save(cursor, sql3, data3)

            # Calculate the number of used cars per month
            sql4 = 'insert into db_car_month(month,count) values(%s,%s)'
            save(cursor,sql4,data4)

            conn.commit()
    except pymysql.MySQLError as error:
        print(error)
        conn.rollback()
  • Data conversion file MongoDB to xlsx (to_excle.py)
import pandas as pd
import numpy as np
from pymongo import MongoClient

def export_excel(export):
    # Convert dictionary list to DataFrame
    df = pd.DataFrame(list(export))
    # Specifies the name of the generated Excel table
    file_path = pd.ExcelWriter('carhome.xlsx')
    # Replace empty cells
    df.fillna(np.nan, inplace=True)
    # output
    df.to_excel(file_path, encoding='utf-8', index=False)
    # Save form
    file_path.save()


if __name__ == '__main__':
    # Convert MongoDB data to xlsx file
    client = MongoClient()
    connection = client['test']['car_home']
    ret = connection.find({}, {'_id': 0})
    data_list = list(ret)
    export_excel(data_list)

  • Database model file display (models.py)
from . import db


class BaseModel(object):
    id = db.Column(db.Integer, autoincrement=True, primary_key=True)
    count = db.Column(db.Integer)

# Calculate the top ten in the number of different brands
class CarBrandTop10(BaseModel,db.Model):
    __tablename__ = 'db_car_brand_top10'
    brand = db.Column(db.String(32))

# Calculate the service time of used cars
class CarUseYear(BaseModel,db.Model):
    __tablename__ = 'db_car_use_year'
    year_area = db.Column(db.String(32))

# Calculate the number of used cars in different regions
class CarArea(BaseModel,db.Model):
    __tablename__='db_car_area'
    area = db.Column(db.String(32))

# Number of used cars per month
class CarMonth(BaseModel,db.Model):
    __tablename__='db_car_month'
    month = db.Column(db.Integer)

  • Configuration file code display (config.py)
# Basic configuration
class Config(object):
    SECRET_KEY = 'msqaidyq1314'
    SQLALCHEMY_DATABASE_URI = "mysql://root:123456@localhost:3306/car_home"
    SQLALCHEMY_TRACK_MODIFICATIONS = True


class DevelopmentConfig(Config):
    DEBUG = True

class ProductConfig(Config):
    pass

# Create configuration class mapping
config_map = {
    'develop':DevelopmentConfig,
    'product':ProductConfig
}
  • Main project directory code display (api_1_0/_init_.py)
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
import pymysql
from config import config_map
pymysql.install_as_MySQLdb()

db = SQLAlchemy()
def create_app(config_name='develop'):
    # Initialize app object
    app = Flask(__name__)
    config = config_map[config_name]
    app.config.from_object(config)

    # Load database
    db.init_app(app)

    # Registration blueprint
    from . import api_1_0
    app.register_blueprint(api_1_0.api,url_prefix="/show")

    return app
  • Main program file code display (manager.py)
from car_home import create_app,db
from flask_migrate import Migrate,MigrateCommand
from flask_script import Manager
from flask import render_template

app = create_app()

manager = Manager(app)

Migrate(app,db)

manager.add_command('db',MigrateCommand)

@app.route('/')
def index():
    return render_template('index.html')

if __name__ == '__main__':
    manager.run()
  • View file code display (api_1_0/views/_init_.py, show.py)

_init_.py

from flask import Blueprint
from car_home import models
api = Blueprint('api_1_0',__name__)

from . import show

show.py

from . import api
from car_home.models import CarArea,CarUseYear,CarBrandTop10,CarMonth
from flask import render_template
# Calculate the top ten in the number of different brands
@api.route('/showBrandBar')
def showBrandBar():
    car_brand_top10 = CarBrandTop10.query.all()
    brand = [i.brand for i in car_brand_top10]
    count = [i.count for i  in car_brand_top10]
    print(brand)
    print(count)
    return render_template('showBrandBar.html', **locals())


# Calculate the service time of used cars
@api.route('/showPie')
def showPie():
    car_use_year = CarUseYear.query.all()
    data = [{'name':i.year_area,'value':i.count} for i in car_use_year]
    return render_template('showPie.html',**locals())


# Calculate the number of used cars in different regions
@api.route('/showAreaBar')
def showAreaBar():
    car_area = CarArea.query.all()
    area = [i.area for i in car_area]
    count = [i.count for i in car_area]
    return render_template('showAreaBar.html',**locals())

# Calculate the number of used cars per month
@api.route('/showLine')
def showLine():
    car_month = CarMonth.query.all()
    month = [i.month for i in car_month]
    count = [i.count for i in car_month]
    return render_template('showLine.html',**locals())


  • Homepage display (index.html)

The home page simply creates four hyperlinks to the corresponding charts

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Visual analysis of automobile home</title>
    <style>
        ul{
            width: 800px;
            height: 600px;
            {#list-style: none;#}
            line-height: 60px;
            padding: 40px;
            margin: auto;
        }
        ul li{
            margin-bottom: 20px;
        }
    </style>
</head>
<body>
<ul>
    <li><a href="{{ url_for('api_1_0.showBrandBar') }}"><h3>Calculate the top ten in the number of different brands</h3></a></li>
    <li><a href="{{ url_for('api_1_0.showPie') }}"><h3>Calculate the service time of used cars</h3></a></li>
    <li><a href="{{ url_for('api_1_0.showAreaBar') }}"><h3>Calculate the number of used cars in different regions</h3></a></li>
    <li><a href="{{ url_for('api_1_0.showLine') }}"><h3>Calculate the number of used cars per month</h3></a></li>
</ul>
</body>
</html>

  • Template file code display (showAreaBar.html, showBrandBar.html, showLine.html, showPie.html)

showPie.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Calculate the number of used cars in different regions</title>
    <script src="../static/js/echarts.min.js"></script>
    <script src="../static/js/vintage.js"></script>
</head>
<body>
<div class="cart" style="width: 800px;height: 600px;margin: auto"></div>
<script>
    var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')
    var data = {{ data|tojson }}
    var option = {
        title:{
            text:'Number of used cars in different regions',
            textStyle:{
                fontSize:21,
                fontFamily:'Regular script'
            },
            left:10,
            top:10
        },
        legend:{
            name:['region'],
            left:10,
            bottom:10,
            orient:'vertical'
        },
        tooltip:{
            trigger:'item',
            triggerOn:'mousemove',
            formatter:function (arg)
            {
                return 'Region:'+arg.name+"<br>"+"number:"+arg.value+"<br>"+"Proportion:"+arg.percent+"%"
            }
        },
        series:[
            {
                type:'pie',
                data:data,
                name:'Usage time',
                label:{
                    show:true
                },
                radius:['50%','80%'],
                {#roseType:'radius'#}
                itemStyle:{
                    borderWidth:2,
                    borderRadius:10,
                    borderColor:'#fff'
                },
                selectedMode:'multiple',
                selectedOffset:20
            }
        ]
    }
    MyCharts.setOption(option)
</script>
</body>
</html>

Conclusion: by observing the pie chart, it can be seen that Suzhou is the city with the most second-hand car sales in Jiangsu Province, followed by Nanjing. It can be concluded that the more economically developed cities are, the wider the second-hand car market is.

showBrandBar.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Calculate the top ten in the number of different brands</title>
    <script src="../static/js/echarts.min.js"></script>
    <script src="../static/js/vintage.js"></script>
</head>
<body>
<div class="cart" style="height: 600px;width: 800px;margin: auto"></div>
<script>
    var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')
    var brand = {{ brand|tojson }}
    var count = {{ count|tojson }}

    var option = {
        title:{
            text:'Top ten in the number of different brands',
            textStyle:{
                fontSize:21,
                fontFamily:'Regular script'
            },
            left:10,
            top:10
        },
        xAxis:{
            type:'category',
            data:brand,
            axisLabel:{
                interval:0,
                rotate:30,
                margin:20
            }
        },
        legend:{
            name:['Automobile brand']
        },
        yAxis:{
            type:'value',
            scale:true
        },
        tooltip:{
            trigger:'item',
            triggerOn: 'mousemove',
            formatter:function(arg)
            {
                return 'Brand:'+arg.name+'<br>'+'number:'+arg.value
            }
        },
        series:[
            {
                type:'bar',
                data:count,
                name:'Automobile brand',
                label:{
                    show:true,
                    position:'top',
                    rotate: true
                },
                showBackground:true,
                backgroundStyle: {
                    color:'rgba(180,180,180,0.2)'
                }
            }
        ]
    }
    MyCharts.setOption(option)

</script>
</body>
</html>

Conclusion: by observing the histogram, we can see that the used cars in Jiangsu Province are mainly BMW, Mercedes Benz and Audi, among which BMW used cars are sold the most, and BMW 5 series and BMW 3 series are in the first and second positions.

showLine.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Calculate the number of used car releases per month</title>
    <script src="../static/js/echarts.min.js"></script>
    <script src="../static/js/vintage.js"></script>
</head>
<body>
<div class="cart" style="width: 800px;height: 600px;margin: auto"></div>
<script>
    var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')
    var month = {{ month|tojson }}
    var count = {{ count|tojson }}
    var option = {
        title:{
            text:'Number of used car releases per month',
            textStyle:{
                fontSize:21,
                fontFamily:'Regular script'
            },
            left:10,
            top:10
        },
        xAxis:{
            type:'category',
            data:month,
            axisLabel:{
                interval:0,
                rotate:30,
                margin:20
            }
        },
        legend:{
            name:['quantity']
        },
        tooltip:{
            trigger:'axis',
            triggerOn:'mousemove',
            formatter:function(arg){
                return 'month:'+arg[0].name+'month'+"<br>"+'number:'+arg[0].value
            }
        },
        yAxis:{
            type:'value',
            scale:true
        },
        series:[
            {
                type:'line',
                name:'quantity',
                data:count,
                label:{
                    show:true
                },
                showBackground:true,
                backgroundStyle:{
                    color:'rgba(180,180,180,0.2)'
                },
                markPoint:{
                    data:[
                        {
                            name:'Maximum',
                            type:'max',
                            symbolSize:[40,40],
                            symbolOffset:[0,-20],
                            label:{
                                show: true,
                                formatter:function (arg)
                                {
                                    return arg.name
                                }
                            }
                        },
                        {
                            name:'minimum value',
                            type:'min',
                            symbolSize:[40,40],
                            symbolOffset:[0,-20],
                            label:{
                                show: true,
                                formatter:function (arg)
                                {
                                    return arg.name
                                }
                            }
                        }
                    ]
                },
                markLine:{
                    data:[
                        {
                            type:"average",
                            name:'average value',
                            label:{
                                show:true,
                                formatter:function(arg)
                                {
                                    return arg.name+':\n'+arg.value
                                }
                            }
                        }
                    ]
                }
            }
        ]
    }
    MyCharts.setOption(option)
</script>
</body>
</html>

Conclusion: by observing the broken line chart, we can see that the number of used cars released in January is the most, the number of used cars released in February is the least, and most months are lower than the average release level.

showAreaBar.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Calculate the service time of used cars</title>
    <script src="../static/js/echarts.min.js"></script>
    <script src="../static/js/vintage.js"></script>
</head>
<body>
<div class="cart" style="width: 800px;height: 600px;margin: auto"></div>

<script>
    var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')
    var area = {{ area|tojson }}
    var count = {{ count|tojson }}

    var option = {
        title:{
            text:'Used car usage time',
            textStyle:{
                fontSize:21,
                fontFamily:'Regular script'
            }
        },
        xAxis:{
            type:'category',
            data:area,
            axisLabel:{
                interval:0,
                rotate:30,
                margin:10
            }
        },
        legend:{
            name:['Automobile brand']
        },
        yAxis:{
            type:'value',
            scale:true
        },
        tooltip:{
            tigger:'item',
            triggerOn:'mousemove',
            formatter:function(arg)
            {
                return 'Years:'+arg.name+"<br>"+'number:'+arg.value
            }
        },
        series:[
            {
                type:'bar',
                data:count,
                name:'Automobile brand',
                label:{
                    show:true,
                    position:'top',
                    rotate: 30,
                    distance:15
                },
                barWidth:'40%',
                showBackground:true,
                backgroundStyle: {
                    color:'rgba(180,180,180,0.2)'
                }
            }
        ]
    }
    MyCharts.setOption(option)

</script>
</body>
</html>

Conclusion: by observing the histogram, we can see that most of the second-hand cars in Jiangsu Province are used within one year, and the number of second-hand cars used for more than three years is less. When you find that you don't like the cars you buy, you should sell them early.

The following is the source code of the project. I hope it can help you. If you have any questions, please comment below
Flash project code link

Topics: Python crawler Data Analysis data visualization Flask