Show! Pandas can also crawl!

Posted by gkostenarov on Thu, 16 Dec 2021 07:34:34 +0100

Hello, I'm Chen Cheng.

Talking about Pandas's read XXX series of functions, your first reaction will think of the more commonly used PD read_ CSV () and PD read_ Excel (), most people probably haven't used PD read_ HTML () this function. Although it is low-key, it is very powerful. It is an artifact when used to grab Table data.

Yes, this artifact can be used for reptiles!

01 definition

pd.read_html() is a powerful function. You don't need to master tools such as regular expressions or xpath. You can easily grab Table web page data in just a few lines of code.

02 principle

I Table table data web page structure

To understand the structure of a Table web page, let's take a simple example.

Yes, simple!

Another example:

Sina Finance and Economics

Rule: the Table data displayed in Table structure, and the web page structure is as follows:

<table class="..." id="...">
     <thead>
     <tr>
     <th>...</th>
     </tr>
     </thead>
     <tbody>
        <tr>
            <td>...</td>
        </tr>
        <tr>...</tr>
        <tr>...</tr>
       ...
        <tr>...</tr>
        <tr>...</tr>        
    </tbody>
</table>

II pandas request table data principle

Basic process

Actually, PD read_ HTML can capture the table data on the web page and return it in a list in the form of DataFrame.

III pd.read_html syntax and parameters

Basic syntax:

pandas.read_html(io, match='.+', flavor=None, 

header=None,index_col=None,skiprows=None, 

attrs=None, parse_dates=False, thousands=', ', 

encoding=None, decimal='.', converters=None, na_values=None, 

keep_default_na=True, displayed_only=True)

Main parameters:

io : Receive web address, file and string;

parse_dates: Resolution date;

flavor: Parser;

header: Title line;

skiprows: Skipped lines;

attrs: Properties, such as attrs = {'id': 'table'}

Actual combat

I Case 1: capturing the world university rankings (1 page of data)

1import pandas as pd 
2import csv
3url1 = 'http://www.compassedu.hk/qs'
4df1 = pd.read_html(url1)[0]  #0 represents the first Table in the web page
5df1.to_csv('Comprehensive ranking of World Universities.csv',index=0)

5 lines of code, just a few seconds, data preview:

II Case 2: capture the data of heavy positions of Sina Financial Fund (6 pages of data)

1import pandas as pd
2import csv
3df2 = pd.DataFrame()
4for i in range(6):
5    url2 = 'http://vip.stock.finance.sina.com.cn/q/go.php/vComStockHold/kind/jjzc/index.phtml?p={page}'.format(page=i+1)
6    df2 = pd.concat([df2,pd.read_html(url2)[0]])
7    print('The first{page}Page grab complete'.format(page = i + 1))
8df2.to_csv('./Sina Financial Data.csv',encoding='utf-8',index=0)

8 lines of code, it's still that simple.

Let's preview the data crawled down:

III Case 3: capture the IPO data disclosed by the CSRC (217 pages of data)

 1import pandas as pd
 2from pandas import DataFrame
 3import csv
 4import time
 5start = time.time() #time
 6df3 = DataFrame(data=None,columns=['corporate name','Disclosure date','Listing places and sectors','Disclosure type','see PDF data']) #Add column name
 7for i in range(1,218):  
 8    url3 ='http://eid.csrc.gov.cn/ipo/infoDisplay.action?pageNo=%s&temp=&temp1=&blockType=byTime'%str(i)
 9    df3_1 = pd.read_html(url3,encoding='utf-8')[2]  #utf-8 must be added, otherwise the code is garbled
10    df3_2 = df3_1.iloc[1:len(df3_1)-1,0:-1]  #Filter out the last row and last column (NaN column)
11    df3_2.columns=['corporate name','Disclosure date','Listing places and sectors','Disclosure type','see PDF data'] #New df add column name
12    df3 = pd.concat([df3,df3_2])  #Data merging
13    print('The first{page}Page grab complete'.format(page=i))
14df3.to_csv('./listed company IPO information.csv', encoding='utf-8',index=0) #Save data to csv file
15end = time.time()
16print ('Co grab',len(df3),'Company,' + 'Time use',round((end-start)/60,2),'minute')

Here, pay attention to filtering the captured Table data, mainly using iloc method. In addition, I also added a program timing to check the crawling speed.

2 minutes to climb down 217 pages of 4334 data, quite nice. Let's preview the data crawled down:

IPO data of listed companies:

Note that not all tables can use PD read_html crawling, some websites look like tables on the surface, but in the web page source code, it is not in table format, but in list format. This form does not apply to read_html crawling, you have to use other methods, such as selenium.

Topics: Python