pandas practical method

Posted by phpete0812 on Sun, 19 Dec 2021 19:54:36 +0100

pandas practical method

catalogue

0. View Pandas version information

1. DataFrame file operation

1.1 CSV file read data - read_csv()

1.2CSV file write data - to_csv()

1.3 reading data from excel file - read_excel()

1.4 EXCEL file write data - to_excel()

2. View the data of the first 5 rows or the last 5 rows

2.1 viewing the first 5 rows of data - head()

2.2 view the last 3 lines of data - tail()

3. View data dimensions, feature names, and feature types

3.1 view data dimension - shape

3.2 view data feature name - columns

3.3 view some general information of DataFrame - info()

3.4 viewing basic statistical values of DataFrame - describe()

3.4. 1. View the basic statistical values of numerical characteristics in DataFrame

3.4. 2 view statistics of non numerical characteristics

3.5 viewing data types - dtypes

3.6 view index

3.7 view values

3.8 rename column name - rename()

4. Change the type of column - astype()

5. View feature distribution such as category and Boolean value - value_counts()

5.1 by occurrence frequency

5.2 proportion of occurrence

6. Sort sort_values()

6.1 sorting according to the value of a single column

6.2 sorting by multi column values

7. Get data

7.1 get a separate column according to the column name - DataFrame['Name ']

7.2 indexing with Boolean values

7.3 by name index - loc

7.4 by numerical index -iloc

7.5 first and last rows of dataframe

7.6 copy of dataframe - copy()

7.7 query by keyword - isin()

7.8 find the smallest column in the DataFrame table - idxmin()/idxmax()

7.9} DataFrame grouping, and get the sum of the maximum three numbers in each group

7.10 check the number of cities - unique()

7.11 delete duplicate data drop_duplicates()

7.12 data filtering query()

7.13 calculation of correlation coefficient - corr()

7.14 calculate YoY -pct_change()

7.15 add prefix or suffix - add to column names_ Prefix() or add_suffix()

7.16 reset index reset_index()

7.17 select column by data type - select_dtypes()

7.18 building a dataframe glob module from multiple files

8. Apply functions to cells, columns, rows - apply()

8.1 apply to columns

8.2 apply to each line

9. Replace value

9.1 replace the value in a column - map()

9.2 replace the value in a column - replace()

10. Group by ()

11. Summary / PivotTable

11.1 pivot table_ table()

11.1. 1. Creation of pivot table

11.1. 2. Default value processing of pivot table

11.2 crosstab()

11.2. 1. Distribution according to occurrence frequency

11.2. 2 distribution according to occurrence proportion

12. Increase or decrease rows and columns of DataFrame

12.1 method of adding columns

12.1.1 insert()

12.1. 2 add directly

12.2 deleting columns and rows - drop()

13. DataFrame missing value operation

13.1 judge whether the DataFrame element is empty - isnull()

13.2 fill in missing values - fillna(value)

13.3 delete rows with missing values - dropna(how="any")

13.4 missing value fitting - interpolate()

13.5 data column splitting

13.6 remove extra characters in column data - extract()

13.7 format specification

14. String operation

14.1 convert string to lowercase letter - lower()

14.2 converting strings to uppercase letters

14.3 initial capitalization ()

15. Multi table join of DataFrame

15.1 merge by specified column - merge(left, right, on="key")

15.2 splicing multiple dataframe concat () using lists

15.3 splicing column join()

16. Time series processing

16.1 establish a Series with time as index and random number as value

16.2 sum of corresponding values of each Wednesday in statistics s#

16.3} average value of monthly value in statistics s

16.4 converting time in Series (seconds to minutes)

16.5 UTC world time standard

16.6 conversion of different time representations

17. DataFrame multiple index

17.1 creating a DataFrame based on multiple indexes

17.2 multi index setting column name

17.3 DataFrame row column name conversion - stack()

17.4 DataFrame index conversion - unstack()

18. Pandas drawing operation

18.1 DataFrame line chart

18.2 scatter diagram of dataframe

18.3 column diagram of dataframe

18.4 setting the size of the drawing

18.5 Chinese garbled codes and symbols are not displayed

19. Reduce DataFrame space size

19.1 viewing space usage

19.2 reduce space size

0. View Pandas version information

import pandas as pd print(pd.__version__)

1. DataFrame file operation

1.1 CSV file read data - read_csv()

pd.read_csv(filepath_or_buffer,header,parse_dates,index_col)
import numpy as np 
import pandas as pd 
import warnings 
warnings.filterwarnings('ignore') 
df = pd.read_csv( 'https://labfile.oss.aliyuncs.com/courses/1283/telecom_churn.csv') 
df

1.2CSV file write data - to_csv()

df3.to_csv('animal.csv') 
print("Write successful.")

 

1.3 reading data from excel file - read_excel()

pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

 

1.4 EXCEL file write data - to_excel()

df3.to_excel('animal.xlsx', sheet_name='Sheet1') 
print("Write successful.")

2. View the data of the first 5 rows or the last 5 rows

2.1 viewing the first 5 rows of data - head()

df.head()

 

2.2 view the last 3 lines of data - tail()

df.tail(3)

3. View data dimensions, feature names, and feature types

3.1 view data dimension - shape

# View data dimensions 
df.shape

 

3.2 view data feature name - columns

# View data feature name 
df.columns

 

3.3 view some general information of DataFrame - info()

# View some general information about the DataFrame 
df.info()

 

3.4 viewing basic statistical values of DataFrame - describe()

3.4. 1. View the basic statistical values of numerical characteristics in DataFrame

# View the basic statistical values of numerical features in the DataFrame, such as values without missing values, mean values, standard deviations, etc 
df.describe()

 

3.4. 2 view statistics of non numerical characteristics

#You can view statistics for non numeric characteristics by explicitly specifying the type of data to include with the include parameter. 
df.describe(include=['object', 'bool'])

 

3.5 viewing data types - dtypes

df2.dtypes

 

3.6 view index

df.index

3.7 view values

df.values

3.8 rename column name - rename()

df.rename(columns={"category":"category-size"})

4. Change the type of column - astype()

df['Churn'] = df['Churn'].astype('int64')

5. View feature distribution such as category and Boolean value - value_counts()

5.1 by occurrence frequency

df['Churn'].value_counts()

 

5.2 proportion of occurrence

df['Churn'].value_counts(normalize=True)

6. Sort sort_values()

6.1 sorting according to the value of a single column

df.sort_values(by='Total day charge', ascending=False).head()

6.2 sorting by multi column values

df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head()

 

7. Get data

7.1 get a separate column according to the column name - DataFrame['Name ']

df['Churn'].mean()

7.2 indexing with Boolean values

df[df['Churn'] == 1].mean() 

df[(df['Churn'] == 0) & (df['International plan'] == 'No') ]['Total intl minutes'].max()

 

7.3 by name index - loc

df.loc[0:5, 'State':'Area code']

 

7.4 by numerical index -iloc

df.iloc[0:5, 0:3]

 

7.5 first and last rows of dataframe

df[:1] 
df[-1:]

7.6 copy of dataframe - copy()

df = df2.copy()

7.7 query by keyword - isin()

df[df['animal'].isin(['cat', 'dog'])]

7.8 find the smallest column in the DataFrame table - idxmin()/idxmax()

df = pd.DataFrame(np.random.random(size=(5, 5)), columns=list('abcde')) 
print(df) 
df.sum().idxmin() # idxmax(), idxmin() returns the index value of the maximum and minimum value for the Series function

7.9} DataFrame grouping, and get the sum of the maximum three numbers in each group

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'), 'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]}) 
print(df) 
df.groupby('A')['B'].nlargest(3).sum(level=0)

7.10 check the number of cities - unique()

df.city.unique() 
set(df.city)

7.11 delete duplicate data drop_duplicates()

 # Delete duplicate data DF city. str.lower(). drop_ duplicates() 
df.city.str.lower().drop_duplicates(keep="last") 
df.city.str.lower().replace("sh","shanghai").drop_duplicates()

7.12 data filtering query()

# Filter using the query function 
df.query('city==["beijing","shanghai"]')

 

7.13 calculation of correlation coefficient - corr()

# Correlation coefficient between m-point and price 
df_inner["price"].corr(df_inner["m-point"]) 

# Correlation coefficient of whole data and numerical variables 
df_inner.corr().round(2)

 

7.14 calculate YoY -pct_change()

df1.pct_change() #Calculates the percentage change between the current element and the previous element

 

7.15 add prefix or suffix - add to column names_ Prefix() or add_suffix()

df.add_prefix("X_")

df.add_suffix()

 

7.16 reset index reset_index()

df.reset_index(drop=True)

7.17 select column by data type - select_dtypes()

df.select_dtypes(include="number").head()

 

 

7.18 building a dataframe glob module from multiple files

from glob import glob
stock_files = sorted(glob("stock*.csv))
stock_files

 

pd.concat((pd.read_csv(file) for file in stock_files))

 

8. Apply functions to cells, columns, rows - apply()

8.1 apply to columns

df.apply(np.max)

 

8.2 apply to each line

The apply() method can also apply the function to each line by specifying axis=1. In this case, it is convenient to use the lambda function. For example, the following function selects all States starting with W.

df[df['State'].apply(lambda state: state[0] == 'W')].head()

 

9. Replace value

9.1 replace the value in a column - map()

d = {'No': False, 'Yes': True} 
df['International plan'] = df['International plan'].map(d) 
df.head()

 

9.2 replace the value in a column - replace()

df = df.replace({'Voice mail plan': d}) 
df.head()

 

10. Group by ()

The general form of packet data in pandas is:

df.groupby(by=grouping_columns)[columns_to_show].function()
  • The groupby() method is based on grouping_columns.
  • Next, select the columns_to_show of interest. If this item is not included, all non group by columns (i.e. all columns except grouping_columns) will be selected.
  • Finally, apply one or more function s.
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] 
df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])

It is similar to the above example, except that this time some functions are passed to {agg(), and the grouped data is aggregated through the} agg() method.

columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] 
df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])

 

11. Summary / PivotTable

11.1 pivot table_ table()

The pivot table in Pandas is defined as follows:

Pivot table is a common data summary tool in spreadsheet programs and other data exploration software. It aggregates data according to one or more keys and distributes data into rectangular areas according to grouping on rows and columns.

Via {pivot_ The table () method can create a pivot table with the following parameters:

  • values represents the variable list of statistics to be calculated
  • index represents the variable list of grouped data
  • aggfunc indicates which statistics need to be calculated, such as sum, mean, maximum, minimum, etc.

11.1. 1. Creation of pivot table

df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean') 
pd.pivot_table(df, values=['D'], index=['A', 'B'], aggfunc=[np.sum, len])

 

11.1. 2. Default value processing of pivot table

pd.pivot_table(df, values=['D'], index=['A', 'B'], columns=['C'], aggfunc=np.sum, fill_value=0)

 

11.2 crosstab()

Cross Tabulation is a special pivot table used to calculate the grouping frequency. In Pandas, the cross table is generally constructed using the {crosstab() method.

11.2. 1. Distribution according to occurrence frequency

pd.crosstab(df['Churn'], df['International plan'])

 

11.2. 2 distribution according to occurrence proportion

pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)

 

12. Increase or decrease rows and columns of DataFrame

12.1 method of adding columns

12.1.1 insert()

total_calls = df['Total day calls'] + df['Total eve calls'] + \ df['Total night calls'] + df['Total intl calls'] 
# The loc parameter is the number of columns selected after the Series object is inserted 
# Set to len(df.columns) to paste the calculated Total calls to the last column 
df.insert(loc=len(df.columns), column='Total calls', value=total_calls) 
df.head()

 

12.1. 2 add directly

df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + \ df['Total night charge'] + df['Total intl charge'] 
df.head()

 

12.2 deleting columns and rows - drop()

# Remove previously created columns 
df.drop(['Total charge', 'Total calls'], axis=1, inplace=True) 
# Delete row 
df.drop([1, 2]).head()

 

13. DataFrame missing value operation

13.1 judge whether the DataFrame element is empty - isnull()

df.isnull()

 

13.2 fill in missing values - fillna(value)

df4 = df3.copy() df4.fillna(value=3)

 

13.3 delete rows with missing values - dropna(how="any")

df5 = df3.copy() df5.dropna(how='any') # Any rows with NaN will be deleted

 

13.4 missing value fitting - interpolate()

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'], 'FlightNumber': [10045, np.nan, 10065, np.nan, 10085], 'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]], 'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']}) 
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) 
df

 

13.5 data column splitting

temp = df.From_To.str.split('_', expand=True) 
temp.columns = ['From', 'To'] 
temp

 

13.6 remove extra characters in column data - extract()

df['Airline'] = df['Airline'].str.extract( '([a-zA-Z\s]+)', expand=False).str.strip() 
df

 

13.7 format specification

delays = df['RecentDelays'].apply(pd.Series) 
delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)] 
df = df.drop('RecentDelays', axis=1).join(delays) 
df

 

14. String operation

14.1 convert string to lowercase letter - lower()

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) 
string.str.lower()

 

14.2 converting strings to uppercase letters

string.str.upper()

14.3 initial capitalization ()

print(temp) temp['From'] = temp['From'].str.capitalize() 
temp['To'] = temp['To'].str.capitalize() 
temp

 

15. Multi table join of DataFrame

15.1 merge by specified column - merge(left, right, on="key")

The number of rows in the result set does not increase, and the number of columns is the number of columns of the two metadata and minus the number of join keys.

DataFrame.merge(left, right, how='inner', on=None, left_on=None, right_on =None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate =None)[source]
left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]}) 
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]}) 
# Align the connection according to the key column. Only foo2 is the same, so it finally becomes a row 
pd.merge(left, right, on='key')

 

15.2 splicing multiple dataframe concat () using lists

Vertical stack of CONCAT. Vertical stacking is axis=0. In this case, there is a special parameter called 'ignore'_ Index = ', False by default. If it is set to True, the resulting consolidated table is re indexed. Otherwise, the row index is still the value in the original two tables

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index= False, keys=None, levels=None, names=None, verify_integrity= False, sort=None, copy=True)

temp_df1 = pd.DataFrame(np.random.randn(5, 4)) # Generate DataFrame 1 composed of random numbers 
temp_df2 = pd.DataFrame(np.random.randn(5, 4)) # Generate DataFrame 2 composed of random numbers 
temp_df3 = pd.DataFrame(np.random.randn(5, 4)) # Generate DataFrame 3 composed of random numbers 
pieces = [temp_df1, temp_df2, temp_df3] 
pd.concat(pieces)

 

15.3 splicing column join()

JOIN spliced columns are mainly used for merging based on row indexes.

As long as the column names of the two tables are different, they can be used directly without adding any parameters; If two tables have duplicate column names, lsuffix and rsuffix parameters need to be specified; The meaning of the parameter is basically the same as that of the merge method, except that the join method defaults to the left outer connection how=left

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort= False)
df1.join(df2,lsuffix="letter",rsuffix="lerrer)

 

16. Time series processing

16.1 establish a Series with time as index and random number as value

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D') 
s = pd.Series(np.random.rand(len(dti)), index=dti) 
s

 

16.2 sum of corresponding values of each Wednesday in statistics s#

# Monday from 0 
s[s.index.weekday == 2].sum()

 

16.3} average value of monthly value in statistics s

s.resample('M').mean()

 

16.4 converting time in Series (seconds to minutes)

s = pd.date_range('today', periods=100, freq='S') 
ts = pd.Series(np.random.randint(0, 500, len(s)), index=s) 
ts.resample('Min').sum()

 

16.5 UTC world time standard

s = pd.date_range('today', periods=1, freq='D') # Get current time 
ts = pd.Series(np.random.randn(len(s)), s) # Random value 
ts_utc = ts.tz_localize('UTC') # Convert to UTC time 
ts_utc

 

Convert to Shanghai time zone

ts_utc.tz_convert('Asia/Shanghai')

16.6 conversion of different time representations

rng = pd.date_range('1/1/2018', periods=5, freq='M') 
ts = pd.Series(np.random.randn(len(rng)), index=rng) 
print(ts) 
ps = ts.to_period() 
print(ps) 
ps.to_timestamp()

 

17. DataFrame multiple index

17.1 creating a DataFrame based on multiple indexes

frame = pd.DataFrame(np.arange(12).reshape(6, 2), index=[list('AAABBB'), list('123123')], columns=['hello', 'shiyanlou']) 
frame

 

17.2 multi index setting column name

frame.index.names = ['first', 'second'] 
frame

 

17.3 DataFrame row column name conversion - stack()

stack() turns the original column index (A, B) into the innermost row index (ABAB)

print(frame) 
frame.stack()

 

17.4 DataFrame index conversion - unstack()

The unstack() method is very similar to the pivot() method. The main difference is that the unstack() method is for indexes or labels, that is, the column index is transformed into the innermost row index; The pivot() method is for the value of the column, that is, specify the value of a column as the row index, specify the value of a column as the column index, and then specify which columns as the corresponding values of the index

print(frame) 
frame.unstack()

 

18. Pandas drawing operation

18.1 DataFrame line chart

f = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]}) 
df = df.cumsum() 
df.plot()

18.2 scatter diagram of dataframe

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]}) 
df = df.cumsum() 
df.plot.scatter("xs", "ys", color='red', marker="*")

 

18.3 column diagram of dataframe

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52], "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9], "month": range(12) }) 
ax = df.plot.bar("month", "revenue", color="yellow") 
df.plot("month", "advertising", secondary_y=True, ax=ax)

 

18.4 setting the size of the drawing

Method 1:

%pylab inline figsize(16, 6)

Method 2:

import matplotlib.pyplot as plt 
plt.style.use({'figure.figsize':(12, 6)})

18.5 Chinese garbled codes and symbols are not displayed

plt.rcParams['font.sans-serif'] = ['SimHei'] 
plt.rcParams['axes.unicode_minus'] =False

19. Reduce DataFrame space size

19.1 viewing space usage

drinks.info(memory_usage="deep")

 

19.2 reduce space size

1) Read only the columns actually needed

cols= ["beer_servings","continent"] 
small_drinks = pd.read_csv("./drinksbycountry.csv",usecols=cols) 
small_drinks.info(memory_usage="deep")

 

2) Convert all object columns that are actually category variables to category variables

dtypes={"continent":"category"} 
smaller_drinks = pd.read_csv("./drinksbycountry.csv",usecols=cols,dtype=dtypes) 
smaller_drinks.info(memory_usage="deep")

 

Topics: Python