How to obtain point in time data from tushare

Posted by tblade on Tue, 04 Jan 2022 17:02:20 +0100

tushare ID: 456497

In the research of asset pricing, the data we use should follow the principle of point in time, that is, to obtain the latest data that can be obtained corresponding to the call date. It sounds simple, but it's not. In general, There are five methods to construct the factor (cross section, not timing) from the original report: 1. Obtain the latest (actual reporting period) data of the latest reporting period that can be obtained on the corresponding call date. 2. Obtain the single quarter degree data of the latest reporting period that can be obtained on the corresponding call date (for example, most of the data in the income statement are accumulated quarterly, and the difference is required, but the difference will also change with the change of the call date). 3. Obtain the year-on-year growth rate of the corresponding latest reporting period that can be obtained on the corresponding call date. 4. Obtain the ttm data of the latest reporting period that can be obtained in the corresponding reporting period (the sum or average of the corresponding data of the past year). 5. Obtain the ttm data of the latest reporting period available in the corresponding reporting period (Synthesis of 2 and 4). Obviously, the first case is the simplest, and the latter is more complex because it involves cross quarter situations. The reason for the complexity is that when calling the previous data, the called data will change with the change of the call date. For example, a company published its 2016 annual report in April 2017, with a total assets of 100 million, and published its adjusted annual report in August 2017, or in April 2018 The benchmark statement will change the total assets of the 2016 annual report, assuming that it is changed to 150 million. If the total assets are transferred in April 2017, 100 million should be transferred. If the year-on-year growth rate of total assets is calculated after April 2018, the annual report data of 2016 will be required, then 150 million should be used.

First, get all the reports of tushare:

import tushare as ts
import pandas as pd
import numpy as np
from time import time
path='D://data//data_test'
ts.set_token(own token)
pro = ts.pro_api()
code = pro.stock_basic()

def get_balancesheet(codelist,starttime,endtime,report_type=1):
    try:
        startdate=str(starttime)
        enddate=str(endtime)
        df = pro.balancesheet(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type)
        return df.set_index('ts_code')
    except:
        df=get_balancesheet(codelist,starttime,endtime,report_type)
        return df.set_index('ts_code')

def get_income(codelist,starttime,endtime,report_type=1):
    try:
        startdate=str(starttime)
        enddate=str(endtime)
        df = pro.income(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type)
        return df.set_index('ts_code')
    except:
        df=get_income(codelist,starttime,endtime,report_type)
        return df.set_index('ts_code')

def get_cashflow(codelist,starttime,endtime,report_type=1):
    try:
        startdate=str(starttime)
        enddate=str(endtime)
        df = pro.cashflow(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type)
        return df.set_index('ts_code')
    except:
        df=get_cashflow(codelist,starttime,endtime,report_type)
        return df.set_index('ts_code')

def get_bar(codelist,starttime,endtime,typ=None):
    try:
        startdate=str(starttime)
        enddate=str(endtime)
        df = ts.pro_bar(ts_code=codelist,start_date=startdate,end_date=enddate,adj=typ)
        return df.set_index('ts_code')
    except:
        df=get_bar(codelist,starttime,endtime)
        return df.set_index('ts_code')

def get_name(codelist):
    try:
        data = pro.namechange(ts_code=codelist)
        return data
    except:
        data=get_name(codelist)
        return data


for i in range(0,code.shape[0]):
    try:
        bfq1=get_bar(code['ts_code'].iloc[i],20100101,20201231)
        bfq2=get_bar(code['ts_code'].iloc[i],20000101,20091231)
        bfq=bfq1.append(bfq2).sort_values('trade_date',ascending=False)
    except:
        1
    try:
        hfq1=get_bar(code['ts_code'].iloc[i],20100101,20201231,'hfq')
        hfq2=get_bar(code['ts_code'].iloc[i],20000101,20091231,'hfq')
        hfq=hfq1.append(hfq2).sort_values('trade_date',ascending=False)
    except:
        1    
    try:
        balancesheet1=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,1)
    except:
        1
    try:
        balancesheet4=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,4)
    except:
        1
    try:
        balancesheet5=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,5)
    except:
        1
    try:
        income1=get_income(code['ts_code'].iloc[i],20000101,20201231,1)
    except:
        1
    try:
        income4=get_income(code['ts_code'].iloc[i],20000101,20201231,4)
    except:
        1
    try:
        income5=get_income(code['ts_code'].iloc[i],20000101,20201231,5)
    except:
        1
    try:
        cashflow1=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,1)
    except:
        1
    try:
        cashflow4=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,4)
    except:
        1
    try:
        cashflow5=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,5)
    except:
        1
    try:
        name=get_name(code['ts_code'].iloc[i])
    except:
        1

It is worth mentioning that there are many types of reports. We need to call the initial report after adjustment, the benchmark report after adjustment and the initial report before adjustment, and the benchmark report before adjustment. The relevant data is in the report of tushare function_ Returns when the type is 1, 4, and 5 respectively. Some companies report_ When type = 4 or 5, the related report may be null. What we need to do is try to download the three reports and merge them, and then according to f_ann_date，end_date and update_flag three data to select the data you need. Where, f_ann_date is the actual announcement date, end_date is the reporting period, update_flag is the adjustment flag. If you want to observe whether a sample is an original report or an adjusted or revised report, you need to compare f_ann_date and end_date. If there are two samples_ Date is the same, while f_ann_date is different, there is no doubt f_ ann_ Samples with large date are adjusted or corrected. Of course, there are many cases. For example, there are two samples and one end_date large, f_ann_date is small, and the other end_date small, f_ ann_ If the date is large, the second sample must be the corrected data after the report corresponding to the first sample has been published. In addition, there is an extreme case: the corrected data and the initial data appear on the same day, and only update is kept at this time_ Flag = 1. The following is a function to extract point in time data:

def get_pit_data(codelist,table='balancesheet',req_type=1,varlist=None):
    data1=pd.read_excel(path+'//0//'+table+'//1//'+codelist+'.xlsx')
    try:
        data4=pd.read_excel(path+'//0//'+table+'//4//'+codelist+'.xlsx')
    except:
        1
    try:
        data5=pd.read_excel(path+'//0//'+table+'//5//'+codelist+'.xlsx')
    except:
        1
    data=1*data1
    data_tool=1*data1
    try:
        data=data.append(data4)
        data_tool=data_tool.append(data4)
    except:
        1
    try:
        data=data.append(data5)
        data_tool=data_tool.append(data5)
    except:
        1
    data=data.sort_values(['end_date','f_ann_date','update_flag'])
    if varlist!=None:
        try:
            var1=['f_ann_date','end_date']
            varlist=var1+varlist
            varlist.append('update_flag')
            data=data[varlist]
        except:
            var1=['f_ann_date','end_date']
            var1.append(varlist)
            var1.append('update_flag')
            data=data[var1]

Above is the first paragraph of the function. Where codelist such as' 00000 1 SZ ', table, optional' balancesheet','income 'or' cashflow ', req_type can be 1-5, corresponding to the first five cases respectively. varlist can select the string composed of variables in the report or not. Next, we should first deal with the situation that the corrected data and the initial data are published on the same day, such as stock 002157 SZ's income statement:

f_ann_date  end_date  update_flag
6    20191024  20190930            0
0    20201028  20190930            0
5    20200421  20191231            0
4    20200421  20191231            1
2    20200421  20200331            0
3    20200421  20200331            1
1    20200828  20200630            1
0    20201028  20200930            1

Among them, multiple data were published continuously on April 21, 2020, and only update should be kept at this time_ Sample with flag = 1.

    data.index=range(0,data.shape[0])
    data=data.loc[data[['f_ann_date','end_date']].drop_duplicates(keep='last').index]
    data_tool=data_tool.sort_values(['end_date','f_ann_date','update_flag'])

Note that data is the data in which the above situation is deleted, and data_tool has always been full data. Next, deal with the first case. The key to the first case is to delete samples similar to the following:

    f_ann_date  end_date  update_flag    tool
4     20070816  20060630            0   210.0
6     20071026  20060930            0  -300.0
7     20070726  20061231            0  9692.0
8     20080418  20061231            0     8.0
10    20080426  20070331            0 -9610.0
     f_ann_date  end_date  update_flag   tool
113    20201028  20190930            0 -607.0
115    20200421  20191231            1    0.0
117    20200421  20200331            1  407.0
118    20200828  20200630            1  200.0

20171026 of the reports and 20200421 of the 2019 annual reports need to be deleted (to retain the latest actual reporting period data for the latest reporting period of the current call date).

    if req_type==1:
        for i in range(0,100):
            data['tool']=data['f_ann_date'].diff().shift(-1).fillna(1)
            data=data[data['tool']>0]
            data['tool']=data['f_ann_date'].diff().shift(-1).fillna(1)
            if data['tool'].min()>0:
                break
        del data['tool']
        return data

Next, we deal with the second case. The key to the second case is for each f_ann_date, find the latest reporting period. If the reporting period is not in the first quarter, find the previous reporting period and the actual reporting period is in the current F_ ann_ All data before date. If end_ If date is repeated, keep f_ann_date is the largest data. If the previous reporting period is missing, the data of the current quarter should also be missing. If the current reporting period is the first quarter, the data of the current quarter is the original data.

    if req_type==2:
        result=data*1
        for i in range(0,100):
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            result=result[result['tool']>0]
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            if result['tool'].min()>0:
                break
        del result['tool']
        result['year']=(result['end_date']/10000).astype(int)
        result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int)
        data_tool['year']=(data_tool['end_date']/10000).astype(int)
        data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int)
        droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag']
        for i in range(0,result.shape[0]):
            #Select an actual announcement period in all data and filter the previous announcement period
            tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]]
            #tool3 is the data of the last quarter of the same year corresponding to an actual announcement date and the actual announcement period is less than or equal to the current actual announcement period (if it is a quarter, it is an empty matrix)
            tool3=tool3[tool3['year']==result['year'].iloc[i]]
            tool3=tool3[tool3['season']==result['season'].iloc[i]-1]
            tool3=tool3[tool3['f_ann_date']<=result['f_ann_date'].iloc[i]]
            #Select the latest result of tool3 as the data of the same quarter of the previous year known in the current actual announcement period
            try:
                tool3=tool3.iloc[tool3.shape[0]-1]
                for j in data.columns:
                    if j not in droping_code:
                        result[j].iloc[i]=result[j].iloc[i]-tool3[j]
            #If an error is reported, it must be because tool3 in the first sentence above is an empty matrix. There may be no previous one
            #Quarterly data or this quarter is the first quarter
            except:
                for j in data.columns:
                    if j not in droping_code:
                        if result['season'].iloc[i]==1:
                            result[j].iloc[i]=result[j].iloc[i]
                        else:
                            result[j].iloc[i]=np.nan
        del result['year']
        del result['season']
        return result

The later ones are too lazy to write for the time being. Let's look at the code ourselves (cover your face)

    if req_type==3:
        tool1=1*data
        tool1['year']=(tool1['end_date']/10000).astype(int)
        tool1['season']=((tool1['end_date']-10000*tool1['year'])/400+1).astype(int)
        #tool2 is all data
        tool2=1*data_tool
        tool2['year']=(tool2['end_date']/10000).astype(int)
        tool2['season']=((tool2['end_date']-10000*tool2['year'])/400+1).astype(int)
        droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag']
        for i in range(0,100):
            tool1['tool']=tool1['f_ann_date'].diff().shift(-1).fillna(1)
            tool1=tool1[tool1['tool']>0]
            tool1['tool']=tool1['f_ann_date'].diff().shift(-1).fillna(1)
            if tool1['tool'].min()>0:
                break
        del tool1['tool']
        for i in range(0,tool1.shape[0]):
            #tool3 is the filtered data of an actual announcement period selected from all data
            tool3=tool2[tool2['f_ann_date']<=tool1['f_ann_date'].iloc[i]]
            #tool3 is the data of the same quarter of the previous year corresponding to an actual announcement date
            tool3=tool3[tool3['year']==tool1['year'].iloc[i]-1]
            tool3=tool3[tool3['season']==tool1['season'].iloc[i]]
            tool3=tool3[tool3['f_ann_date']<=tool1['f_ann_date'].iloc[i]]
            #Select the latest result of tool3 as the data of the same quarter of the previous year known in the current actual announcement period
            try:
                tool3=tool3.iloc[tool3.shape[0]-1]
                for j in data.columns:
                    if j not in droping_code:
                        tool1[j].iloc[i]=100*(tool1[j].iloc[i]-tool3[j])/tool3[j]
            except:
                for j in data.columns:
                    if j not in droping_code:
                        tool1[j].iloc[i]=np.nan
        del tool1['year']
        del tool1['season']
        return tool1
    if req_type==4:
        result=data*1
        for i in range(0,100):
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            result=result[result['tool']>0]
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            if result['tool'].min()>0:
                break
        del result['tool']
        result['year']=(result['end_date']/10000).astype(int)
        result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int)
        data_tool['year']=(data_tool['end_date']/10000).astype(int)
        data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int)
        droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag']

        for i in range(0,result.shape[0]):
            tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]]
            tool3=tool3[tool3['year']>=result['year'].iloc[i]-1]
            tool3=tool3[tool3['year']<=result['year'].iloc[i]]
            tool3.index=range(0,tool3.shape[0])
            tool3.loc[tool3[tool3['year']==result['year'].iloc[i]-1][tool3[tool3['year']==result['year'].iloc[i]-1]['season']<=result['season'].iloc[i]].index]=np.nan
            tool3=tool3.dropna(axis=0,how='all')
            tool3.index=range(0,tool3.shape[0])
            tool3.loc[tool3[tool3['year']==result['year'].iloc[i]][tool3[tool3['year']==result['year'].iloc[i]]['season']>result['season'].iloc[i]].index]=np.nan
            tool3=tool3.dropna(axis=0,how='all')
            tool3.index=range(0,tool3.shape[0])
            tool3=tool3.loc[tool3['end_date'].drop_duplicates(keep='last').index]
            if tool3.shape[0]==4:
                tool3=tool3.sum(skipna=False)
                for j in data.columns:
                    if j not in droping_code:
                        result[j].iloc[i]=tool3[j]
            else:
                for j in data.columns:
                    if j not in droping_code:
                        result[j].iloc[i]=np.nan
        del result['year']
        del result['season']
        return result
    if req_type==5:
        result=data*1
        for i in range(0,100):
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            result=result[result['tool']>0]
            result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1)
            if result['tool'].min()>0:
                break
        del result['tool']
        result['year']=(result['end_date']/10000).astype(int)
        result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int)
        data_tool['year']=(data_tool['end_date']/10000).astype(int)
        data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int)
        droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag']
        for i in range(0,result.shape[0]):
            tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]]
            tool3=tool3[tool3['year']>=result['year'].iloc[i]-1]
            tool3=tool3[tool3['year']<=result['year'].iloc[i]]
            tool3.index=range(0,tool3.shape[0])
            tool3.loc[tool3[tool3['year']==result['year'].iloc[i]-1][tool3[tool3['year']==result['year'].iloc[i]-1]['season']<result['season'].iloc[i]].index]=np.nan
            tool3=tool3.dropna(axis=0,how='all')
            tool3.index=range(0,tool3.shape[0])
            tool3.loc[tool3[tool3['year']==result['year'].iloc[i]][tool3[tool3['year']==result['year'].iloc[i]]['season']>result['season'].iloc[i]].index]=np.nan
            tool3=tool3.dropna(axis=0,how='all')
            tool3.index=range(0,tool3.shape[0])
            tool3=tool3.loc[tool3['end_date'].drop_duplicates(keep='last').index]
            try:
                tool4=tool3[tool3['year']==tool3['year'].iloc[tool3.shape[0]-1]-1]
                tool5=tool4[tool4['season']==tool3['season'].iloc[tool3.shape[0]-1]]
                tool6=tool4[tool4['season']==4]
                for j in data.columns:
                    if j not in droping_code:
                        result[j].iloc[i]=np.nan
                        result[j].iloc[i]=tool3[j].iloc[tool3.shape[0]-1]+tool6[j].iloc[0]-tool5[j].iloc[0]
            except:
                for j in data.columns:
                    if j not in droping_code:
                        result[j].iloc[i]=np.nan
        del result['year']
        del result['season']
        return result

Good luck

Topics: Python

Programmer Think

How to obtain point in time data from tushare

Hot Topics