Tide information analysis

Posted by md_grnahid on Sat, 15 Jan 2022 07:17:48 +0100

Tide information analysis

(1) Climb cninfo Vanke A website to download pdf
(2) Filter specified fields from pdf
(3) Visual analysis using python

preface

Bloggers need to crawl the pdf of the annual report of the specified company in the tide information, download it, filter the specified fields of the pdf, and then conduct visual analysis

In view of the budget problem, several basic charts are used for visual analysis, such as scatter chart
Line chart and histogram

1, First, analyze the tide information web page

The struggle between crawler and anti crawler is all the time. I met two people when I did the anti crawler mechanism of this web page
1. The url of the web page does not change when turning the page


Observe that the url of these two images has not changed
Through this analysis, we can use two methods. One is to obtain the web page source code by using the post request
The second is to use python's selenium library for analysis
2. Every time the web page is refreshed, the data will not be refreshed immediately
Today, bloggers use the second method. If there are too many methods, they won't explain them one by one

(1) . functional preparation

1. Import and storage

The code is as follows (example):

import re
import time
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By

If selenium is used, you need to install the driver of the corresponding browser on your computer. In addition to the library provided by the system, you need to use pip install to install it

2. The code is as follows:

The code is as follows (example):

list = []
def page_turn():
    url = 'http://www.cninfo.com.cn/new/disclosure/stock?orgId=gssz0000002&stockCode=000002#latestAnnouncement'
    browser.get(url)
    time.sleep(1)
    browser.find_element(By.XPATH,'//*[@id="main"]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[1]/form/div[2]/div/span/button').click()
    browser.find_element(By.CLASS_NAME, 'cate-checkbox-txt').click()
    browser.find_element(By.XPATH,'//*[@id="main"]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[4]/div/div/button[2]').click()
    time.sleep(1)
    data = browser.page_source
    print("Perform page turning operation")
    return data

    # if page_next =
def search(data):
    # Get web page source code
    # Get page title
    p_title = '<span class="r-title">(.*?)</span>'
    title = re.findall(p_title, data)
    # Get web address
    p_href = '<div class="cell"><a target="_blank" href="(.*?)" data-id='
    href = re.findall(p_href, data)
    for index in range(len(href)):
        # URL cleanup
        href[index] = 'http://www.cninfo.com.cn' + href[index]
        href[index] = re.sub('amp;', '', href[index])
        # pdf file download
        res = requests.get(url=href[index])
        path = "Inquiry letter//" + title[index] + ".pdf"
        # print(path)
        print(href[index])
        list.append(href[index])
        try:
            for i in list:
                browser.get(i)
                browser.find_element(By.XPATH,'//*[@id="noticeDetail"]/div/div[1]/div[3]/div[1]/button').click()
            print("The first" + str(index + 1) + "File No. crawls successfully!")
        except:
            print("no pdf edition")
def main():
        url = 'http://www.cninfo.com.cn/new/disclosure/stock?orgId=gssz0000002&stockCode=000002#latestAnnouncement'
        browser.get(url)
        time.sleep(1)
        browser.find_element(By.XPATH,'//*[@id="main"]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[1]/form/div[2]/div/span/button').click()
        browser.find_element(By.CLASS_NAME,'cate-checkbox-txt').click()

        time.sleep(1)
        data = browser.page_source
        return data
if __name__ == "__main__":
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    #Set download path
    prefs = {'profile.default_content_settings.popups': 0 ,'download.default_directory':r'c:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf'}
    chrome_options.add_experimental_option('prefs',prefs)
    browser = webdriver.Chrome(options=chrome_options)
    main()
    search(main())
    page_turn()
    search(page_turn())

Here, we use selenium for pdf download

2, Filter the specified field and write it into csv

Import and warehousing

import pdfplumber
import os
import csv

The code is as follows

import pdfplumber
import os
import csv
def parase_pdf(table_keyword,inside_keyword,outside_keyword):
   #global variable
    global pdffile_list
    global parase_out_writer
    global parase_out
    global OUT_DIR
    global file_names
    while True:
        #rLock2.acquire()
        if len(file_names):
            print('--------{}---------'.format(len(file_names)))
            file_name=file_names[0]
            file_names.remove(file_name)
            if file_name.endswith('.PDF') or file_name.endswith('.pdf'):
                path =os.path.join(OUT_DIR,file_name)
                print('get pdf address')

                try:
                    pdf = pdfplumber.open(path,password='')
                except:
                    print("*************open pdf error*******************")
                print("*************open pdf*******************")

                find_table=0
                find_pre_table=0
                find_keyword=0
                find_keyword_outside=0
                name_find=[]
                value_find=[]
                page_find=[]
                #for page in pdf.pages:
                    #print(page.extract_text())
                begin_index=int(len(pdf.pages)/2)
                for i in range(begin_index,len(pdf.pages)):
                    if find_table:
                        find_pre_table=1
                    else:
                        find_pre_table=0
                    find_table=0
                    page=pdf.pages[i]
                    #print(page.extract_text())
                    data=page.extract_text()
                    if len(table_keyword):
                        for keyword in table_keyword:
                            if keyword in data:
                                find_table=1
                            else:
                                find_table=0
                                break
                    else:
                        find_table=1

                    if find_table or find_pre_table:
                        data_list=data.strip().split()
                        for j in range(len(data_list)):
                            if len(inside_keyword):
                                for keyword in inside_keyword:
                                    if keyword in data_list[j]:
                                        find_keyword=1
                            else:
                                find_keyword=1

                            if find_keyword:
                                find_keyword=0
                                print('run here')
                                if len(outside_keyword):
                                    for keyword in outside_keyword:
                                        if keyword not in data_list[j]:
                                            find_keyword_outside=1
                                        else:
                                            find_keyword_outside=0
                                            break
                                else:
                                    find_keyword_outside=1

                                if find_keyword_outside:
                                    find_keyword_outside=0
                                    try:
                                        temp_value=data_list[j+1]
                                        temp_value=temp_value.replace(',','')
                                        temp_value=float(temp_value)
                                        name_find.append(data_list[j])
                                        value_find.append(temp_value)
                                        page_find.append(i)
                                        try:
                                            parase_out_writer.writerows([[file_name,data_list[j],str(temp_value),data_list[j+1],str(i)]])
                                        except:
                                            pass
                                        parase_out.flush()
                                        print("*****find******{} value is {} and {}".format(data_list[j],data_list[j+1],temp_value))
                                        print("*************find in page {}*******************".format(i))
                                        print("*************find in {}*******************".format(path))
                                        break        # only find one result
                                    except:
                                        continue
                pdf.close()
                # os.remove(path)    #  pdf. Delete the file after close, otherwise there are too many files
                print('****time to processing PDF file is ')
            else:
                path =os.path.join(OUT_DIR,file_name)
                # os.remove(path)
    return name_find,value_find,page_find    # Be sure not to return put to while Inside, meet return It will end immediately #str(time.strftime('%Y-%m-%d'))
OUT_DIR = r'Notice pdf'
table_keyword=['Income statement']
inside_keyword=['business income']
outside_keyword=['received']
# , 'operating profit', 'asset liability ratio'
file_names=os.listdir(OUT_DIR)
parase_out_file_path=OUT_DIR+'/parase_out_file2.csv'
parase_out=open(parase_out_file_path, 'w', newline='', encoding='utf-8')
parase_out_writer = csv.writer(parase_out)
parase_pdf(table_keyword,inside_keyword,outside_keyword)
import pdfplumber
import os
import csv
inside_keyword = 'Asset liability ratio'
def parase_pdf(inside_keyword):

    global pdffile_list
    global parase_out_writer
    global parase_out
    global OUT_DIR
    global file_names
    while True:
        # rLock2.acquire()
        if len(file_names):
            print('--------{}---------'.format(len(file_names)))
            file_name = file_names[0]
            file_names.remove(file_name)
            if file_name.endswith('.PDF') or file_name.endswith('.pdf'):
                path = os.path.join(OUT_DIR, file_name)
                print('get pdf address')
                try:
                    pdf = pdfplumber.open(path,password='')
                except:
                    print("*************open pdf error*******************")
                print("*************open pdf*******************")
                for page in pdf.pages:
                    data = page.extract_text()
                    if inside_keyword in page.extract_text():
                        # print(page.extract_text())
                        data_list = data.strip().split()
                        for j in range(len(data_list)):
                                    if inside_keyword in data_list[j]:
                                        # print(data_list)
                                        print('extract'+f'{inside_keyword}'+'in')
                                        if len(data_list[j])<7:
                                            print(data_list[j],data_list[j+1])
                                            # print(len(data_list[j]))
                                            try:
                                                parase_out_writer.writerows([[file_name, data_list[j],data_list[j + 1]]])
                                            except:
                                                pass
                                            parase_out.flush()
# print(dict(data_list))
OUT_DIR = r'Notice pdf'
file_names=os.listdir(OUT_DIR)
parase_out_file_path=OUT_DIR+'/parase_out_file5.csv'
parase_out=open(parase_out_file_path, 'w', newline='', encoding='utf-8')
parase_out_writer = csv.writer(parase_out)
parase_pdf(inside_keyword)

summary

Tip: here is a summary of pdf content extraction:
Because the asset liability ratio mainly extracted by Bo is a percentage, which is different from the operating revenue and operating profit, after the two are extracted separately, if they need to be applied, change the path to your own file path

3, Visual analysis of data in csv

(1)
The data we extracted from pdf is still irregular. What should we do at this time
You have to clean and organize the data
Direct code

First step

import pandas as pd
import re
df = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv'),header=None,names=['id','earning','number1','number2']Add header

Step 2

df['year'] = df['name'].apply(lambda x: re.findall(r'\d{4}',x)[0])#New year column
df.to_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv',index=False)
df


Does this look much more comfortable? Bloggers visually check one for obsessive-compulsive disorder

df = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv')
# df.to_csv(r'C:\Users252\PycharmProjects\pythonProject \ cninfo \ announcement pdf\parase_out_file2.csv',index=False)
df.sort_values("year",ascending=True,inplace=True)#Sort by year
df.to_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv',index=False)
df1 = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file3.csv')
df1.sort_values("year",ascending=True,inplace=True)#Sort by year
df1.to_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file3.csv',index=False)
df1

df3 = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file4.csv')
# df3.to_csv(r'C:\Users252\PycharmProjects\pythonProject \ cninfo \ announcement pdf\parase_out_file4.csv',index=False) save
df3['year'] = df3['name'].apply(lambda x: re.findall(r'\d{4}',x)[-1])#Extraction year
df3.sort_values("year",ascending=True,inplace=True)#Sort by year
df3.to_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file4.csv',index=False)
df3
df3.drop_duplicates('data',keep='first',inplace=True)
df3.to_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file4.csv',index=False)
df3 #This principle applies to all three tables

Although these data are not much, they feel a little messy. They need to be sorted and redone, and rows and columns need to be added
(2)
In the second step, we need to start drawing the chart
Introduce the library we need

import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator

Because every data is a principle, bloggers will not give examples one by one

Drawing of line chart

df = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv')
plt.figure(figsize=(16,6))
ax=plt.gca()
x_major_locator=MultipleLocator(1)#Set interval
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(df['year'],df['number1'],color='#A0522D',marker='o',label = "operating revenue", linewidth=2,linestyle = "--")

Drawing of scatter diagram

Introduce the library we need

import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator
df = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv')
plt.figure(figsize=(16,6))
x=df['year']
y=df['number1']
ax=plt.gca()
x_major_locator=MultipleLocator(1)
ax.xaxis.set_major_locator(x_major_locator)
plt.scatter(x,y,alpha=0.5,marker='*',c='r',label="Pentagonal")

design sketch

Drawing of histogram

Import and warehousing

import pandas as pd
import matplotlib.pyplot as plt
from numpy import arange
from matplotlib.font_manager import FontProperties
plt.style.use('fivethirtyeight')
df = pd.read_csv(r'C:\Users\13252\PycharmProjects\pythonProject\Tide information\Notice pdf\parase_out_file2.csv')
# df
date = df['year']
data=df['number1']
plt.bar(date,data,width=0.5,alpha=0.5)
plt.title("Main business income and operating income",fontproperties=font_set)

Drawing of histogram

summary

For each chart, I only wrote a chart drawn with csv data, and the rest just need to draw a gourd according to the gourd
Some parts of Mengxin's code may not be marked or written. I hope you can point out more
If you have any questions, you can talk to me in private. You are welcome to make more corrections

Topics: Python crawler Data Analysis Data Mining