Share 2 scripts for Python processing Excel

Posted by astribuncio on Wed, 02 Feb 2022 17:00:12 +0100

1, Write in front

It comes from the study / work needs of two readers. Coincidentally, the needs of these two readers are about batch merging sheet s (information needed for retrieval).

The data style used in this paper can be disclosed with the consent of the reader. The data in the file is generated by the Rand function in Excel to ensure that everyone will not encounter obstacles in the process of learning and using the code. See the end of the article for the acquisition methods of data and code.

2, Basic knowledge summary

  • pandas creates a DataFrame object

pd.DataFrame()

  • pandas datafrmae index

Index by column name: dataframe [column name]
Index by column value: dataframe[dataframe [column name] = = column value]

  • pandas reads and stores excel files and csv files

read_excel,to_excel,to_csv

  • pandas datafrmae fetches the specified row of data according to the thumbnail

dataframe.loc[list]

  • pandas datafrmae modifying column names

dataframe.rename(columns={'column_name_old':'column_name_new'})

  • pandas datafrmae inserts data into the specified column

dataframe.insert(loc = column serial number, column = column name, value = column value)

  • pandas datafrmae deletes the specified column based on the column name

dataframe.drop([column name], axis=1)

  • pandas connects multiple datafrmae s

pd.concat([df_1, df_2])

3, Start thinking

3.1 needs of the first reader

First, let's look at the needs of the first reader: the original data has 18 Excel files, and there are 34 sheets (relevant data of 34 provinces) in each excel file. We need to take out several rows of data specified in each sheet, then merge them all and store them in a new file named 2000_ Carbon emission data of all provinces in 2017.

After communication, I have determined the style of the final output file. The following data are generated by random functions in Excel:

To complete this requirement, if it is manual operation, we need to complete the following steps:

0. Create a new Excel

1. Open the first Excel
2. Copy several rows of data required in each sheet
3. Paste the copied data into the new Excel
4. Repeat 1-3 until all the data in Excel is taken out
5. Save new Excel

If there are only 1-2 documents, it is acceptable to start with hands, but if there are dozens or hundreds, it will be big if you rely on hands.
Now let's take a look at the steps required to replace the above manual operation with code operation:

0. Create a new data storage object (we use Dataframe in pandas)

1. Read target Excel file
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
3. for loop traversal, read all target Excel data and store it in the new Dataframe
4. Save the newly created Dataframe data as an Excel file

With this in mind, let's start a pleasant journey of code:

0. Create a new data storage object (we use Dataframe in pandas)

df_concat = pd.DataFrame()
Copy code

1. Read target Excel file
There are 18 files in total, and the file names are regular.

file_path = 'data/2000 year-2017 Annual carbon emission inventory/2000 Emission inventory of 30 provinces in.xlsx'
data = pd.read_excel(file_path, sheet_name=None)
Copy code

2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
For the readability of the code, a function get is written here_ sheet_ Data to get the data needed in a single sheet, and then the for loop traverses all sheets.

'''
Take out a single sheet Data required in
'''
def get_sheet_data(data, sheet_name, year):
    # Get the required rows of data
    df_concat = data[sheet_name].loc[[2,3,48,49]]
    # Rename Unnamed: 0 column
    df_concat = df_concat.rename(columns={'Unnamed: 0':'category'})
    # Insert two columns of data 	 particular year
    df_concat.insert(loc=0,column='province',value=sheet_name)
    df_concat.insert(loc=1,column='particular year',value=i)
    # Move the Total column to the fourth column
    df_temp = df_concat['Total']
    df_concat = df_concat.drop(['Total'],axis=1)  # Delete the column first
    df_concat.insert(loc=3,column='Total',value=df_temp)  # Then insert into the fourth column
    return df_concat

for sheet_name in list(data.keys()):
        if sheet_name == 'Sum':
            continue
        df_temp = get_sheet_data(data, sheet_name, year)
        df_concat = pd.concat([df_concat, df_temp])
Copy code

3. for loop traversal, read all target Excel data and store it in the new Dataframe
In the previous step, all sheet s in a single excel have been read out. Now use the for loop to read all data in Excel.

'''
Take out a single Excel Data required in
'''
def get_excel_data(data, year):
    df_concat = pd.DataFrame()
    for sheet_name in list(data.keys()):
        if sheet_name == 'Sum':
            continue
        df_temp = get_sheet_data(data, sheet_name, year)
        df_concat = pd.concat([df_concat, df_temp])
    return df_concat

# Generate a list and store the time
date_year = [str(i) for i in range(2000, 2018)]
for i in date_year:
    file_path = 'data/2000 year-2017 Annual carbon emission inventory/%s Emission inventory of 30 provinces in.xlsx'%i
    data = pd.read_excel(file_path, sheet_name=None)
    df_temp = get_excel_data(data, i)
    df_concat = pd.concat([df_concat, df_temp])
Copy code

4. Save the newly created Dataframe data as an Excel file
Here, the built-in to of pandas is called directly_ Excel function, the first parameter is the file storage directory, and the second parameter is sheet_name, the third parameter is the encoding format, which is specified here as utf-8.

print("Start storing data")
df_concat.to_excel("data/2000_2017 Provincial carbon emission data in.xlsx", "2000_2017", index=None, encoding="utf-8")
print("Data saved successfully")
Copy code

The complete code is as follows:

import pandas as pd
import time

'''
Take out a single sheet Data required in
'''
def get_sheet_data(data, sheet_name, year):
    # Get the required rows of data
    df_concat = data[sheet_name].loc[[2,3,48,49]]
    # Rename Unnamed: 0 column
    df_concat = df_concat.rename(columns={'Unnamed: 0':'category'})
    # Insert two columns of data 	 particular year
    df_concat.insert(loc=0,column='province',value=sheet_name)
    df_concat.insert(loc=1,column='particular year',value=i)
    # Move the Total column to the fourth column
    df_temp = df_concat['Total']
    df_concat = df_concat.drop(['Total'],axis=1)  # Delete the column first
    df_concat.insert(loc=3,column='Total',value=df_temp)  # Then insert into the fourth column
    return df_concat

'''
Take out a single Excel Data required in
'''
def get_excel_data(data, year):
    df_concat = pd.DataFrame()
    for sheet_name in list(data.keys()):
        if sheet_name == 'Sum':
            continue
        df_temp = get_sheet_data(data, sheet_name, year)
        df_concat = pd.concat([df_concat, df_temp])
    return df_concat

# 0. Create a new data storage object (we use Dataframe in pandas)
df_concat = pd.DataFrame()

# Generate a list and store the time
date_year = [str(i) for i in range(2000, 2018)]

# 1. Traverse and take out several rows of data required in each sheet in each Excel and store them in the new Dataframe
for i in date_year:
    file_path = 'data/2000 year-2017 Annual carbon emission inventory/%s Emission inventory of 30 provinces in.xlsx'%i
    data = pd.read_excel(file_path, sheet_name=None)
    df_temp = get_excel_data(data, i)
    df_concat = pd.concat([df_concat, df_temp])

# 2. Write data
print("Start storing data")
df_concat.to_excel("data/2000_2017 Provincial carbon emission data in.xlsx", "2000_2017", index=None, encoding="utf-8")
print("Data saved successfully")
Copy code

3.2 second reader needs

Let's look at the needs of the second reader: the original data has only one file, which has eight sheets. You need to take out several columns in each sheet, and then store them as csv files one by one according to the date.

To complete this requirement, if it is manual operation, we need to complete the following steps:

0. Open Excel file

1. Copy several rows of data required in each sheet
2. Sort by date
3. Save different data into different csv files by date

It seems very simple, but it is actually complex. For example, if you want to manually create and save 365 csv files, the file names are still different. You have a big head to think about!

Now let's take a look at the steps required to replace the above manual operation with code operation:

0. Create a new data storage object (we use Dataframe in pandas)

1. Read target Excel file
2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe
3. Group according to the date and store the data of different dates in the corresponding file

After knowing this, let's start a happy journey of code: 0. Create a new data storage object (we use the Dataframe in pandas)

df_concat = pd.DataFrame()
Copy code

1. Read target Excel file

file_path = 'data/meteo_china_tmin_2018.xlsx'
data = pd.read_excel(file_path, sheet_name=None)
Copy code

2. Traverse and take out several rows of data required in each sheet and store them in the new Dataframe

for sheet_name in list(data.keys()):
    if sheet_name == 'meteo_china_tmin_2018':
        continue
    df_temp = data[sheet_name][['ymd', 'lat', 'lon', 'tmin']]
    df_concat = pd.concat([df_concat, df_temp])
Copy code

3. Group according to the date and store the data of different dates in the corresponding file
Here, the corresponding data is retrieved according to the date, and to is called_ The CSV function stores data. The first parameter is the stored directory, the second parameter columns is the stored data column, the third parameter header=None indicates that the header is not required during storage, and the fourth parameter index=False indicates that the index is removed.

'''
Group by time and save as csv file
 File format: hetao-ymd_tmin
'''
# Get all dates
ymd_set = set(df_concat['ymd'])
# Cycle all data
for ymd in ymd_set:
    ymd_data = df_concat[df_concat['ymd']==ymd]
    ymd_data.to_csv('./data/hetao/hetao-%d_tmin.csv'%ymd, columns=['lat', 'lon', 'tmin'], header=None, index=False)
Copy code

Full code:

import pandas as pd

'''
Read, take out and merge the required data
'''
file_path = './data/meteo_china_tmin_2018.xlsx'
data = pd.read_excel(file_path, sheet_name=None)
df_concat = pd.DataFrame()

for sheet_name in list(data.keys()):
    if sheet_name == 'meteo_china_tmin_2018':
        continue
    df_temp = data[sheet_name][['ymd', 'lat', 'lon', 'tmin']]
    df_concat = pd.concat([df_concat, df_temp])

'''
Group by time and save as csv file
 File format: hetao-ymd_tmin
'''
# Get all dates
ymd_set = set(df_concat['ymd'])
# Cycle all data
for ymd in ymd_set:
    ymd_data = df_concat[df_concat['ymd']==ymd]
    # Specify the stored column and remove the header
    ymd_data.to_csv('./data/hetao/hetao-%d_tmin.csv'%ymd, columns=['lat', 'lon', 'tmin'], header=None, index=False)
Copy code

4, Just say it

If you have any similar needs, you can say your needs. It's best to divide them into 1, 2 and 3 according to the function points, and then attach the example data. Welcome to learn and communicate.

Topics: Python Data Analysis Data Mining