tushare ID: 456497
In the research of asset pricing, the data we use should follow the principle of point in time, that is, to obtain the latest data that can be obtained corresponding to the call date. It sounds simple, but it's not. In general, There are five methods to construct the factor (cross section, not timing) from the original report: 1. Obtain the latest (actual reporting period) data of the latest reporting period that can be obtained on the corresponding call date. 2. Obtain the single quarter degree data of the latest reporting period that can be obtained on the corresponding call date (for example, most of the data in the income statement are accumulated quarterly, and the difference is required, but the difference will also change with the change of the call date). 3. Obtain the year-on-year growth rate of the corresponding latest reporting period that can be obtained on the corresponding call date. 4. Obtain the ttm data of the latest reporting period that can be obtained in the corresponding reporting period (the sum or average of the corresponding data of the past year). 5. Obtain the ttm data of the latest reporting period available in the corresponding reporting period (Synthesis of 2 and 4). Obviously, the first case is the simplest, and the latter is more complex because it involves cross quarter situations. The reason for the complexity is that when calling the previous data, the called data will change with the change of the call date. For example, a company published its 2016 annual report in April 2017, with a total assets of 100 million, and published its adjusted annual report in August 2017, or in April 2018 The benchmark statement will change the total assets of the 2016 annual report, assuming that it is changed to 150 million. If the total assets are transferred in April 2017, 100 million should be transferred. If the year-on-year growth rate of total assets is calculated after April 2018, the annual report data of 2016 will be required, then 150 million should be used.
First, get all the reports of tushare:
import tushare as ts import pandas as pd import numpy as np from time import time path='D://data//data_test' ts.set_token(own token) pro = ts.pro_api() code = pro.stock_basic() def get_balancesheet(codelist,starttime,endtime,report_type=1): try: startdate=str(starttime) enddate=str(endtime) df = pro.balancesheet(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type) return df.set_index('ts_code') except: df=get_balancesheet(codelist,starttime,endtime,report_type) return df.set_index('ts_code') def get_income(codelist,starttime,endtime,report_type=1): try: startdate=str(starttime) enddate=str(endtime) df = pro.income(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type) return df.set_index('ts_code') except: df=get_income(codelist,starttime,endtime,report_type) return df.set_index('ts_code') def get_cashflow(codelist,starttime,endtime,report_type=1): try: startdate=str(starttime) enddate=str(endtime) df = pro.cashflow(ts_code=codelist,start_date=startdate,end_date=enddate,report_type=report_type) return df.set_index('ts_code') except: df=get_cashflow(codelist,starttime,endtime,report_type) return df.set_index('ts_code') def get_bar(codelist,starttime,endtime,typ=None): try: startdate=str(starttime) enddate=str(endtime) df = ts.pro_bar(ts_code=codelist,start_date=startdate,end_date=enddate,adj=typ) return df.set_index('ts_code') except: df=get_bar(codelist,starttime,endtime) return df.set_index('ts_code') def get_name(codelist): try: data = pro.namechange(ts_code=codelist) return data except: data=get_name(codelist) return data for i in range(0,code.shape[0]): try: bfq1=get_bar(code['ts_code'].iloc[i],20100101,20201231) bfq2=get_bar(code['ts_code'].iloc[i],20000101,20091231) bfq=bfq1.append(bfq2).sort_values('trade_date',ascending=False) except: 1 try: hfq1=get_bar(code['ts_code'].iloc[i],20100101,20201231,'hfq') hfq2=get_bar(code['ts_code'].iloc[i],20000101,20091231,'hfq') hfq=hfq1.append(hfq2).sort_values('trade_date',ascending=False) except: 1 try: balancesheet1=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,1) except: 1 try: balancesheet4=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,4) except: 1 try: balancesheet5=get_balancesheet(code['ts_code'].iloc[i],20000101,20201231,5) except: 1 try: income1=get_income(code['ts_code'].iloc[i],20000101,20201231,1) except: 1 try: income4=get_income(code['ts_code'].iloc[i],20000101,20201231,4) except: 1 try: income5=get_income(code['ts_code'].iloc[i],20000101,20201231,5) except: 1 try: cashflow1=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,1) except: 1 try: cashflow4=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,4) except: 1 try: cashflow5=get_cashflow(code['ts_code'].iloc[i],20000101,20201231,5) except: 1 try: name=get_name(code['ts_code'].iloc[i]) except: 1
It is worth mentioning that there are many types of reports. We need to call the initial report after adjustment, the benchmark report after adjustment and the initial report before adjustment, and the benchmark report before adjustment. The relevant data is in the report of tushare function_ Returns when the type is 1, 4, and 5 respectively. Some companies report_ When type = 4 or 5, the related report may be null. What we need to do is try to download the three reports and merge them, and then according to f_ann_date,end_date and update_flag three data to select the data you need. Where, f_ann_date is the actual announcement date, end_date is the reporting period, update_flag is the adjustment flag. If you want to observe whether a sample is an original report or an adjusted or revised report, you need to compare f_ann_date and end_date. If there are two samples_ Date is the same, while f_ann_date is different, there is no doubt f_ ann_ Samples with large date are adjusted or corrected. Of course, there are many cases. For example, there are two samples and one end_date large, f_ann_date is small, and the other end_date small, f_ ann_ If the date is large, the second sample must be the corrected data after the report corresponding to the first sample has been published. In addition, there is an extreme case: the corrected data and the initial data appear on the same day, and only update is kept at this time_ Flag = 1. The following is a function to extract point in time data:
def get_pit_data(codelist,table='balancesheet',req_type=1,varlist=None): data1=pd.read_excel(path+'//0//'+table+'//1//'+codelist+'.xlsx') try: data4=pd.read_excel(path+'//0//'+table+'//4//'+codelist+'.xlsx') except: 1 try: data5=pd.read_excel(path+'//0//'+table+'//5//'+codelist+'.xlsx') except: 1 data=1*data1 data_tool=1*data1 try: data=data.append(data4) data_tool=data_tool.append(data4) except: 1 try: data=data.append(data5) data_tool=data_tool.append(data5) except: 1 data=data.sort_values(['end_date','f_ann_date','update_flag']) if varlist!=None: try: var1=['f_ann_date','end_date'] varlist=var1+varlist varlist.append('update_flag') data=data[varlist] except: var1=['f_ann_date','end_date'] var1.append(varlist) var1.append('update_flag') data=data[var1]
Above is the first paragraph of the function. Where codelist such as' 00000 1 SZ ', table, optional' balancesheet','income 'or' cashflow ', req_type can be 1-5, corresponding to the first five cases respectively. varlist can select the string composed of variables in the report or not. Next, we should first deal with the situation that the corrected data and the initial data are published on the same day, such as stock 002157 SZ's income statement:
f_ann_date end_date update_flag 6 20191024 20190930 0 0 20201028 20190930 0 5 20200421 20191231 0 4 20200421 20191231 1 2 20200421 20200331 0 3 20200421 20200331 1 1 20200828 20200630 1 0 20201028 20200930 1
Among them, multiple data were published continuously on April 21, 2020, and only update should be kept at this time_ Sample with flag = 1.
data.index=range(0,data.shape[0]) data=data.loc[data[['f_ann_date','end_date']].drop_duplicates(keep='last').index] data_tool=data_tool.sort_values(['end_date','f_ann_date','update_flag'])
Note that data is the data in which the above situation is deleted, and data_tool has always been full data. Next, deal with the first case. The key to the first case is to delete samples similar to the following:
f_ann_date end_date update_flag tool 4 20070816 20060630 0 210.0 6 20071026 20060930 0 -300.0 7 20070726 20061231 0 9692.0 8 20080418 20061231 0 8.0 10 20080426 20070331 0 -9610.0 f_ann_date end_date update_flag tool 113 20201028 20190930 0 -607.0 115 20200421 20191231 1 0.0 117 20200421 20200331 1 407.0 118 20200828 20200630 1 200.0
20171026 of the reports and 20200421 of the 2019 annual reports need to be deleted (to retain the latest actual reporting period data for the latest reporting period of the current call date).
if req_type==1: for i in range(0,100): data['tool']=data['f_ann_date'].diff().shift(-1).fillna(1) data=data[data['tool']>0] data['tool']=data['f_ann_date'].diff().shift(-1).fillna(1) if data['tool'].min()>0: break del data['tool'] return data
Next, we deal with the second case. The key to the second case is for each f_ann_date, find the latest reporting period. If the reporting period is not in the first quarter, find the previous reporting period and the actual reporting period is in the current F_ ann_ All data before date. If end_ If date is repeated, keep f_ann_date is the largest data. If the previous reporting period is missing, the data of the current quarter should also be missing. If the current reporting period is the first quarter, the data of the current quarter is the original data.
if req_type==2: result=data*1 for i in range(0,100): result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) result=result[result['tool']>0] result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) if result['tool'].min()>0: break del result['tool'] result['year']=(result['end_date']/10000).astype(int) result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int) data_tool['year']=(data_tool['end_date']/10000).astype(int) data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int) droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag'] for i in range(0,result.shape[0]): #Select an actual announcement period in all data and filter the previous announcement period tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]] #tool3 is the data of the last quarter of the same year corresponding to an actual announcement date and the actual announcement period is less than or equal to the current actual announcement period (if it is a quarter, it is an empty matrix) tool3=tool3[tool3['year']==result['year'].iloc[i]] tool3=tool3[tool3['season']==result['season'].iloc[i]-1] tool3=tool3[tool3['f_ann_date']<=result['f_ann_date'].iloc[i]] #Select the latest result of tool3 as the data of the same quarter of the previous year known in the current actual announcement period try: tool3=tool3.iloc[tool3.shape[0]-1] for j in data.columns: if j not in droping_code: result[j].iloc[i]=result[j].iloc[i]-tool3[j] #If an error is reported, it must be because tool3 in the first sentence above is an empty matrix. There may be no previous one #Quarterly data or this quarter is the first quarter except: for j in data.columns: if j not in droping_code: if result['season'].iloc[i]==1: result[j].iloc[i]=result[j].iloc[i] else: result[j].iloc[i]=np.nan del result['year'] del result['season'] return result
The later ones are too lazy to write for the time being. Let's look at the code ourselves (cover your face)
if req_type==3: tool1=1*data tool1['year']=(tool1['end_date']/10000).astype(int) tool1['season']=((tool1['end_date']-10000*tool1['year'])/400+1).astype(int) #tool2 is all data tool2=1*data_tool tool2['year']=(tool2['end_date']/10000).astype(int) tool2['season']=((tool2['end_date']-10000*tool2['year'])/400+1).astype(int) droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag'] for i in range(0,100): tool1['tool']=tool1['f_ann_date'].diff().shift(-1).fillna(1) tool1=tool1[tool1['tool']>0] tool1['tool']=tool1['f_ann_date'].diff().shift(-1).fillna(1) if tool1['tool'].min()>0: break del tool1['tool'] for i in range(0,tool1.shape[0]): #tool3 is the filtered data of an actual announcement period selected from all data tool3=tool2[tool2['f_ann_date']<=tool1['f_ann_date'].iloc[i]] #tool3 is the data of the same quarter of the previous year corresponding to an actual announcement date tool3=tool3[tool3['year']==tool1['year'].iloc[i]-1] tool3=tool3[tool3['season']==tool1['season'].iloc[i]] tool3=tool3[tool3['f_ann_date']<=tool1['f_ann_date'].iloc[i]] #Select the latest result of tool3 as the data of the same quarter of the previous year known in the current actual announcement period try: tool3=tool3.iloc[tool3.shape[0]-1] for j in data.columns: if j not in droping_code: tool1[j].iloc[i]=100*(tool1[j].iloc[i]-tool3[j])/tool3[j] except: for j in data.columns: if j not in droping_code: tool1[j].iloc[i]=np.nan del tool1['year'] del tool1['season'] return tool1 if req_type==4: result=data*1 for i in range(0,100): result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) result=result[result['tool']>0] result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) if result['tool'].min()>0: break del result['tool'] result['year']=(result['end_date']/10000).astype(int) result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int) data_tool['year']=(data_tool['end_date']/10000).astype(int) data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int) droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag'] for i in range(0,result.shape[0]): tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]] tool3=tool3[tool3['year']>=result['year'].iloc[i]-1] tool3=tool3[tool3['year']<=result['year'].iloc[i]] tool3.index=range(0,tool3.shape[0]) tool3.loc[tool3[tool3['year']==result['year'].iloc[i]-1][tool3[tool3['year']==result['year'].iloc[i]-1]['season']<=result['season'].iloc[i]].index]=np.nan tool3=tool3.dropna(axis=0,how='all') tool3.index=range(0,tool3.shape[0]) tool3.loc[tool3[tool3['year']==result['year'].iloc[i]][tool3[tool3['year']==result['year'].iloc[i]]['season']>result['season'].iloc[i]].index]=np.nan tool3=tool3.dropna(axis=0,how='all') tool3.index=range(0,tool3.shape[0]) tool3=tool3.loc[tool3['end_date'].drop_duplicates(keep='last').index] if tool3.shape[0]==4: tool3=tool3.sum(skipna=False) for j in data.columns: if j not in droping_code: result[j].iloc[i]=tool3[j] else: for j in data.columns: if j not in droping_code: result[j].iloc[i]=np.nan del result['year'] del result['season'] return result if req_type==5: result=data*1 for i in range(0,100): result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) result=result[result['tool']>0] result['tool']=result['f_ann_date'].diff().shift(-1).fillna(1) if result['tool'].min()>0: break del result['tool'] result['year']=(result['end_date']/10000).astype(int) result['season']=((result['end_date']-10000*result['year'])/400+1).astype(int) data_tool['year']=(data_tool['end_date']/10000).astype(int) data_tool['season']=((data_tool['end_date']-10000*data_tool['year'])/400+1).astype(int) droping_code=['ts_code', 'ann_date', 'f_ann_date', 'end_date', 'report_type','comp_type', 'end_type','update_flag'] for i in range(0,result.shape[0]): tool3=data_tool[data_tool['f_ann_date']<=result['f_ann_date'].iloc[i]] tool3=tool3[tool3['year']>=result['year'].iloc[i]-1] tool3=tool3[tool3['year']<=result['year'].iloc[i]] tool3.index=range(0,tool3.shape[0]) tool3.loc[tool3[tool3['year']==result['year'].iloc[i]-1][tool3[tool3['year']==result['year'].iloc[i]-1]['season']<result['season'].iloc[i]].index]=np.nan tool3=tool3.dropna(axis=0,how='all') tool3.index=range(0,tool3.shape[0]) tool3.loc[tool3[tool3['year']==result['year'].iloc[i]][tool3[tool3['year']==result['year'].iloc[i]]['season']>result['season'].iloc[i]].index]=np.nan tool3=tool3.dropna(axis=0,how='all') tool3.index=range(0,tool3.shape[0]) tool3=tool3.loc[tool3['end_date'].drop_duplicates(keep='last').index] try: tool4=tool3[tool3['year']==tool3['year'].iloc[tool3.shape[0]-1]-1] tool5=tool4[tool4['season']==tool3['season'].iloc[tool3.shape[0]-1]] tool6=tool4[tool4['season']==4] for j in data.columns: if j not in droping_code: result[j].iloc[i]=np.nan result[j].iloc[i]=tool3[j].iloc[tool3.shape[0]-1]+tool6[j].iloc[0]-tool5[j].iloc[0] except: for j in data.columns: if j not in droping_code: result[j].iloc[i]=np.nan del result['year'] del result['season'] return result
Good luck