There are three ways to apply self-defined or other library functions to Pandas objects:
- apply(): apply the function row by row or column by column
- agg() and transform(): aggregation and transformation
- applymap(): apply functions element by element
1, apply()
Where: set the axis = 1 parameter, which can be operated line by line; The default axis=0, that is, the operation is performed column by column;
For common descriptive statistical methods, you can directly use a string instead, for example DF Apply ('mean ') is equivalent to DF apply(np.mean);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 >>> df = pd.read_excel('./input/class.xlsx) >>> df = df[['score_math','score_music']] >>> df score_math score_music 0 95 79 1 96 90 2 85 85 3 93 92 4 84 90 5 88 70 6 59 89 7 88 86 8 89 74 #Average the scores of music and mathematics one by one >>> df.apply(np.mean) score_math 86.333333 score_music 83.888889 dtype: float64 >>> type(df.apply(np.mean)) <class 'pandas.core.series.Series'> >>> df['score_math'].apply('mean') 86.33333333333333 >>> type(df['score_math'].apply(np.mean)) <class 'pandas.core.series.Series'> #Find the average score of each student line by line >>> df.apply(np.mean,axis=1) 0 87.0 1 93.0 2 85.0 3 92.5 4 87.0 5 79.0 6 74.0 7 87.0 8 81.5 dtype: float64 >>> type(df.apply(np.mean,axis=1)) <class 'pandas.core.series.Series'>
The return result of apply() is related to the function used:
- The returned result is a Series object: the mean function applied in the above example returns a value for each row or column;
- Return DataFrame of the same size: such as the lambda function defined below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 #x can be regarded as a Series object of each class >>> df.apply(lambda x: x - 5) score_math score_music 0 90 74 1 91 85 2 80 80 3 88 87 4 79 85 5 83 65 6 54 84 7 83 81 8 84 69 >>> type(df.apply(lambda x: x - 5)) <class 'pandas.core.frame.DataFrame'>
2, Data aggregation (AGG)
- Data aggregation (AGG) refers to any process that can generate scalar values from an array;
- Equivalent to the special case of apply(), pandas objects can be processed row by row or column by column;
- Where agg() can be used, basically apply() can be used instead.
Example:
1) Average the two courses one by one
1 2 3 4 5 6 7 8 >>> df.agg('mean') score_math 86.333333 score_music 83.888889 dtype: float64 >>> df.apply('mean') score_math 86.333333 score_music 83.888889 dtype: float64
2) when multiple functions are applied, the functions can be placed in one list;
E.g.: get the highest and lowest scores for the two courses respectively
1 2 3 4 5 6 7 8 >>> df.agg(['max','min']) score_math score_music max 96 92 min 59 70 >>> df.apply([np.max,'min']) score_math score_music amax 96 92 min 59 70
3) use the dictionary to apply specific and multiple functions to specific columns;
Example: seek the mean and minimum value of mathematics scores and the maximum value of music lessons
1 2 3 4 5 >>> df.agg({'score_math':['mean','min'],'score_music':'max'}) score_math score_music max NaN 92.0 mean 86.333333 NaN min 59.000000 NaN
3, Data transformation ()
Features: after using a function, it returns Pandas objects of the same size
Difference from data aggregation (AGG):
- Data aggregation (AGG) returns the reduction process of the total data in the group;
- The data transformation () returns a new full amount of data.
Note: DF Transform (NP. Mean) will report an error, and the transformation cannot produce aggregation results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 #Subtracting the average score of each course from the score can be achieved by using apply, agg and transfrom >>> df.transform(lambda x:x-x.mean()) >>> df.apply(lambda x:x-x.mean()) >>> df.agg(lambda x:x-x.mean()) score_math score_music 0 8.666667 -4.888889 1 9.666667 6.111111 2 -1.333333 1.111111 3 6.666667 8.111111 4 -2.333333 6.111111 5 1.666667 -13.888889 6 -27.333333 5.111111 7 1.666667 2.111111 8 2.666667 -9.888889
When multiple functions are applied, dataframes with different sizes of the original DataFrame will be returned. The returned results are as follows:
- On the column index, the first level is the original column name
- At the second level is the function name of the transformation
1 2 3 4 5 6 7 8 9 10 11 12 >>> df.transform([lambda x:x-x.mean(),lambda x:x/10]) score_math score_music <lambda> <lambda> <lambda> <lambda> 0 8.666667 9.5 -4.888889 7.9 1 9.666667 9.6 6.111111 9.0 2 -1.333333 8.5 1.111111 8.5 3 6.666667 9.3 8.111111 9.2 4 -2.333333 8.4 6.111111 9.0 5 1.666667 8.8 -13.888889 7.0 6 -27.333333 5.9 5.111111 8.9 7 1.666667 8.8 2.111111 8.6 8 2.666667 8.9 -9.888889 7.4
4, applymap()
applymap() applies a function element by element to the pandas object, which becomes an element level function application;
And map() Differences between:
- applymap() is an instance method of DataFrame
- map() is an instance method of Series
Example: keep the score to two decimal places
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 >>> df.applymap(lambda x:'%.2f'%x) score_math score_music 0 95.00 79.00 1 96.00 90.00 2 85.00 85.00 3 93.00 92.00 4 84.00 90.00 5 88.00 70.00 6 59.00 89.00 7 88.00 86.00 8 89.00 74.00 >>> df['score_math'].map(lambda x:'%.2f'%x) 0 95.00 1 96.00 2 85.00 3 93.00 4 84.00 5 88.00 6 59.00 7 88.00 8 89.00 Name: score_math, dtype: object
As can be seen from the above example, the applymap() operation is actually a map() operation on the Series objects of each column
Through the above analysis, we can see that the three methods of apply, agg and transform can perform functional operations on grouped data, but they also have their own characteristics, which are summarized as follows:
- The user-defined function in apply processes each grouped data separately, and then merges the results; The function output of the whole DataFrame can be scalar, Series or DataFrame; Each apply statement can only pass in one function;
- agg can specify features through dictionary to perform different function operations, and the function output of each feature must be scalar;
- transform cannot specify features through dictionary for different function operations, but the function operation unit is also each feature of DataFrame. The function output of each feature can be scalar or Series, but the scalar will be broadcast.