Manual in-depth Study Chapter II Notes 2

Posted by AV1611 on Sun, 19 Dec 2021 23:09:30 +0100

Data consolidation

Common data consolidation methods for pandas:

  • pandas.merge connects rows in different DataFrame s based on one or more keys, which is the join operation of the database.
  • pandas.concat can stack multiple objects together along an axis
  • Instance method combine_first can stitch together duplicate data and fill in missing values from one object with values from another.

pandas.merge

Merge is a database-style merge operation with the following common parameters:

pd.merge(left,right,on='key',how='inner',on='key',left_on='l_key',
				right_on='r_key',left_index=False,right_index=False,
				suffixes=('_left','_right'))

left,right: left and right tables to be merged
how: optional'inner'inner connection,'outer' outer connection,'left'left connection,'right' right connection
on: The key name used for the join must exist in both tables. If not selected, the join key is the intersection of the two tables
left_on, right_on: Not commonly used, left and right tables are used as join keys
left_index,right_index: Use left/right index as join key, default is False
suffixes: used to distinguish common columns, appended to the end of common columns, defaults to ('_x','_y)
join usage is similar to merge but is a Dataframe member method

pandas.concat

concat can connect to Series, defaulting to axis=0 (extended column)

pd.concat(objs,axis=0,join='outer',join_axes=['a','b','c',],keys=['a','b'],
			names=['upper','lower'],ignore_index=True)

Objs is a required parameter, the others are optional parameters, objs input table to join
axis=0 join column
join='outer'/'inner'union or intersection
join_axes specifies an index on another axis, such as axis=1 specifies a row index
keys specify a hierarchical index on the connection axis
ignore_index is not available in both joined table rows and columns

combine_first()

The two arrays may have the same index but different values, so you can't just use merge or concat, where combine_ The first () member method, which applies to both Series and Dataframe, can be seen here as patching the current array with another array

In [115]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
   .....:                     'b': [np.nan, 2., np.nan, 6.],
   .....:                     'c': range(2, 18, 4)})

In [116]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
   .....:                     'b': [np.nan, 3., 4., 6., 8.]})

In [117]: df1
Out[117]: 
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

In [118]: df2
Out[118]: 
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

In [119]: df1.combine_first(df2)
Out[119]: 
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

Data Aggregation

This section uses the groupby function, which is similar to the grouping query mechanism in databases, and has similar functions in pandas.

First split into groups based on the keys provided, then aggregate
Only one attribute aggregation can be selected after grouping

b=df.groupby('Sex')[['Fare']].mean()
b

			Fare
Sex	
female	44.479818
male	25.523893

The same functionality can be written as follows:

a=df['Survived'].groupby(df['Sex']).value_counts()
a

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

agg() function can also be applied

df.groupby('Sex')['Fare'].agg(['mean','count'])

Custom functions can also be passed in:

In [54]: def peak_to_peak(arr):
   ....:     return arr.max() - arr.min()
In [55]: grouped.agg(peak_to_peak)
Out[55]: 
         data1     data2
key1                    
a     2.170488  1.300498
b     0.036292  0.487276

However, incoming customizations can be much slower than the built-in optimized functions because of the overhead (function calls, data rearrangements, and so on) of constructing intermediate grouped data blocks.

Topics: Python Data Analysis