1, Write in front
It comes from the study / work needs of two readers. Coincidentally, the needs of these two readers are about batch merging sheet s (information needed for retrieval).
The data style used in this paper can be disclosed with the consent of the reader. The data in the file is generated by the Rand function in Excel to ensure that everyone will not encounter obstacles in the process of learning and using the code. See the end of the article for the acquisition methods of data and code.
2, Basic knowledge summary
- pandas creates a DataFrame object
pd.DataFrame()
- pandas datafrmae index
Index by column name: dataframe [column name]
Index by column value: dataframe[dataframe [column name] = = column value]
- pandas reads and stores excel files and csv files
read_excel,to_excel,to_csv
- pandas datafrmae fetches the specified row of data according to the thumbnail
dataframe.loc[list]
- pandas datafrmae modifying column names
dataframe.rename(columns={'column_name_old':'column_name_new'})
- pandas datafrmae inserts data into the specified column
dataframe.insert(loc = column serial number, column = column name, value = column value)
- pandas datafrmae deletes the specified column based on the column name
dataframe.drop([column name], axis=1)
- pandas connects multiple datafrmae s
pd.concat([df_1, df_2])
3, Start thinking
3.1 needs of the first reader
First, let's look at the needs of the first reader: the original data has 18 Excel files, and there are 34 sheets (relevant data of 34 provinces) in each excel file. We need to take out several rows of data specified in each sheet, then merge them all and store them in a new file named 2000_ Carbon emission data of all provinces in 2017.
After communication, I have determined the style of the final output file. The following data are generated by random functions in Excel:
To complete this requirement, if it is manual operation, we need to complete the following steps:
0. Create a new Excel
1. Open the first Excel
2. Copy several rows of data required in each sheet
3. Paste the copied data into the new Excel
4. Repeat 1-3 until all the data in Excel is taken out
5. Save new Excel
If there are only 1-2 documents, it is acceptable to start with hands, but if there are dozens or hundreds, it will be big if you rely on hands.
Now let's take a look at the steps required to replace the above manual operation with code operation:
0. Create a new data storage object (we use Dataframe in pandas)
1. Read target Excel file
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
3. for loop traversal, read all target Excel data and store it in the new Dataframe
4. Save the newly created Dataframe data as an Excel file
With this in mind, let's start a pleasant journey of code:
0. Create a new data storage object (we use Dataframe in pandas)
df_concat = pd.DataFrame() Copy code
1. Read target Excel file
There are 18 files in total, and the file names are regular.
file_path = 'data/2000 year-2017 Annual carbon emission inventory/2000 Emission inventory of 30 provinces in.xlsx' data = pd.read_excel(file_path, sheet_name=None) Copy code
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
For the readability of the code, a function get is written here_ sheet_ Data to get the data needed in a single sheet, and then the for loop traverses all sheets.
''' Take out a single sheet Data required in ''' def get_sheet_data(data, sheet_name, year): # Get the required rows of data df_concat = data[sheet_name].loc[[2,3,48,49]] # Rename Unnamed: 0 column df_concat = df_concat.rename(columns={'Unnamed: 0':'category'}) # Insert two columns of data particular year df_concat.insert(loc=0,column='province',value=sheet_name) df_concat.insert(loc=1,column='particular year',value=i) # Move the Total column to the fourth column df_temp = df_concat['Total'] df_concat = df_concat.drop(['Total'],axis=1) # Delete the column first df_concat.insert(loc=3,column='Total',value=df_temp) # Then insert into the fourth column return df_concat for sheet_name in list(data.keys()): if sheet_name == 'Sum': continue df_temp = get_sheet_data(data, sheet_name, year) df_concat = pd.concat([df_concat, df_temp]) Copy code
3. for loop traversal, read all target Excel data and store it in the new Dataframe
In the previous step, all sheet s in a single excel have been read out. Now use the for loop to read all data in Excel.
''' Take out a single Excel Data required in ''' def get_excel_data(data, year): df_concat = pd.DataFrame() for sheet_name in list(data.keys()): if sheet_name == 'Sum': continue df_temp = get_sheet_data(data, sheet_name, year) df_concat = pd.concat([df_concat, df_temp]) return df_concat # Generate a list and store the time date_year = [str(i) for i in range(2000, 2018)] for i in date_year: file_path = 'data/2000 year-2017 Annual carbon emission inventory/%s Emission inventory of 30 provinces in.xlsx'%i data = pd.read_excel(file_path, sheet_name=None) df_temp = get_excel_data(data, i) df_concat = pd.concat([df_concat, df_temp]) Copy code
4. Save the newly created Dataframe data as an Excel file
Here, the built-in to of pandas is called directly_ Excel function, the first parameter is the file storage directory, and the second parameter is sheet_name, the third parameter is the encoding format, which is specified here as utf-8.
print("Start storing data") df_concat.to_excel("data/2000_2017 Provincial carbon emission data in.xlsx", "2000_2017", index=None, encoding="utf-8") print("Data saved successfully") Copy code
The complete code is as follows:
import pandas as pd import time ''' Take out a single sheet Data required in ''' def get_sheet_data(data, sheet_name, year): # Get the required rows of data df_concat = data[sheet_name].loc[[2,3,48,49]] # Rename Unnamed: 0 column df_concat = df_concat.rename(columns={'Unnamed: 0':'category'}) # Insert two columns of data particular year df_concat.insert(loc=0,column='province',value=sheet_name) df_concat.insert(loc=1,column='particular year',value=i) # Move the Total column to the fourth column df_temp = df_concat['Total'] df_concat = df_concat.drop(['Total'],axis=1) # Delete the column first df_concat.insert(loc=3,column='Total',value=df_temp) # Then insert into the fourth column return df_concat ''' Take out a single Excel Data required in ''' def get_excel_data(data, year): df_concat = pd.DataFrame() for sheet_name in list(data.keys()): if sheet_name == 'Sum': continue df_temp = get_sheet_data(data, sheet_name, year) df_concat = pd.concat([df_concat, df_temp]) return df_concat # 0. Create a new data storage object (we use Dataframe in pandas) df_concat = pd.DataFrame() # Generate a list and store the time date_year = [str(i) for i in range(2000, 2018)] # 1. Traverse and take out several rows of data required in each sheet in each Excel and store them in the new Dataframe for i in date_year: file_path = 'data/2000 year-2017 Annual carbon emission inventory/%s Emission inventory of 30 provinces in.xlsx'%i data = pd.read_excel(file_path, sheet_name=None) df_temp = get_excel_data(data, i) df_concat = pd.concat([df_concat, df_temp]) # 2. Write data print("Start storing data") df_concat.to_excel("data/2000_2017 Provincial carbon emission data in.xlsx", "2000_2017", index=None, encoding="utf-8") print("Data saved successfully") Copy code
3.2 second reader needs
Let's look at the needs of the second reader: the original data has only one file, which has eight sheets. You need to take out several columns in each sheet, and then store them as csv files one by one according to the date.
To complete this requirement, if it is manual operation, we need to complete the following steps:
0. Open Excel file
1. Copy several rows of data required in each sheet
2. Sort by date
3. Save different data into different csv files by date
It seems very simple, but it is actually complex. For example, if you want to manually create and save 365 csv files, the file names are still different. You have a big head to think about!
Now let's take a look at the steps required to replace the above manual operation with code operation:
0. Create a new data storage object (we use Dataframe in pandas)
1. Read target Excel file
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
3. Group according to the date and store the data of different dates in the corresponding file
After knowing this, let's start a happy journey of code: 0. Create a new data storage object (we use the Dataframe in pandas)
df_concat = pd.DataFrame() Copy code
1. Read target Excel file
file_path = 'data/meteo_china_tmin_2018.xlsx' data = pd.read_excel(file_path, sheet_name=None) Copy code
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
for sheet_name in list(data.keys()): if sheet_name == 'meteo_china_tmin_2018': continue df_temp = data[sheet_name][['ymd', 'lat', 'lon', 'tmin']] df_concat = pd.concat([df_concat, df_temp]) Copy code
3. Group according to the date and store the data of different dates in the corresponding file
Here, the corresponding data is retrieved according to the date, and to is called_ The CSV function stores data. The first parameter is the stored directory, the second parameter columns is the stored data column, the third parameter header=None indicates that the header is not required during storage, and the fourth parameter index=False indicates that the index is removed.
''' Group by time and save as csv file File format: hetao-ymd_tmin ''' # Get all dates ymd_set = set(df_concat['ymd']) # Cycle all data for ymd in ymd_set: ymd_data = df_concat[df_concat['ymd']==ymd] ymd_data.to_csv('./data/hetao/hetao-%d_tmin.csv'%ymd, columns=['lat', 'lon', 'tmin'], header=None, index=False) Copy code
Full code:
import pandas as pd ''' Read, take out and merge the required data ''' file_path = './data/meteo_china_tmin_2018.xlsx' data = pd.read_excel(file_path, sheet_name=None) df_concat = pd.DataFrame() for sheet_name in list(data.keys()): if sheet_name == 'meteo_china_tmin_2018': continue df_temp = data[sheet_name][['ymd', 'lat', 'lon', 'tmin']] df_concat = pd.concat([df_concat, df_temp]) ''' Group by time and save as csv file File format: hetao-ymd_tmin ''' # Get all dates ymd_set = set(df_concat['ymd']) # Cycle all data for ymd in ymd_set: ymd_data = df_concat[df_concat['ymd']==ymd] # Specify the stored column and remove the header ymd_data.to_csv('./data/hetao/hetao-%d_tmin.csv'%ymd, columns=['lat', 'lon', 'tmin'], header=None, index=False) Copy code
4, Just say it
If you have any similar needs, you can say your needs. It's best to divide them into 1, 2 and 3 according to the function points, and then attach the example data. Welcome to learn and communicate.