Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (3)

Posted by Zeekar on Tue, 07 Jul 2020 17:27:05 +0200

Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (3)

Merge overlapping data

Numpy.where()

A vectorized if-else

import pandas as pd;import numpy as np
from pandas import Series,DataFrame

a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
Out[8]: 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

b = Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
Out[10]: 
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

np.where(pd.isnull(a),b,a)  #if pd.isnull(a):b else:a
#Returns tuples of data type ndarray/ndarray
Out[11]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
###Get book output
Series(np.where(pd.isnull(a),b,a),index=['f','e','d','c','b','a'])
Out[13]: 
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

combine_first()

You can do the same thing where and align your data

In [14]:b[:-2]
Out[14]: 
f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64
In [15]:a[2:]
Out[15]: 
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
###if pd.isnull(b[:-2]):a[2:] else:b
###Attention to data alignment
b[:-2].combine_first(a[2:])
Out[16]: 
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64


#####The DataFrame can do the same thing, as if the data in the parameter object patches the missing data in the calling object
df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':range(2,18,4)})
Out[18]: 
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
Out[20]: 
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

df1.combine_first(df2)
Out[21]: 
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

Reshape Hierarchical Index

  • stack: Rotate column to row
  • unstack: Rotate rows to columns
data = DataFrame(np.arange(6).reshape(2,3),index = pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
Out[23]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

###Rotate stack method column to row to get Series
result = data.stack()
Out[25]: 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

####Similarly Hierarchized Series Rearranged to DataFrame
result.unstack()
Out[26]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

######By default, stack/unstack operates on the innermost layer, that is, the lowest level, and then the lowest level as the axis of rotation.
#####unstack/stack other levels by passing in the number or name of the hierarchical level
result.unstack(0)##Equivalent toResult.unstack('state')
Out[28]: 
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

Details:
data

number one two three
state
Ohio 0 1 2
Colorado 3 4 5

data.stack() The method rotates a column into a row, where the column of data has only one layer of'number', so rotate the'number'layer. For each row, example,'Ohio' row contains three columns, one:0, two:1, three:2, rotate to a row, which becomes a hierarchical {'Ohio':{'one':0,'two':1,'three':2}} table as follows:

Ohio one 0
two 1
three 2

For row'Colorado'the same,data.stack() The result of rotation is

state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5

It retains the original data column name numberTo row name, because the stack default method is to go to the innermost layer, that is, the lowest layer, and the result of the transfer is also to go to the innermost layer, that is, the lowest layer. So'number'is inside the'state', and then to the hierarchical index, Seres, two-level index, level is, state=0, number=1, the smaller the number represents the higher level.

Handling of default values

s1 = Series([0,1,2,3],index=['a','b','c','d'])
Out[34]: 
a    0
b    1
c    2
d    3
dtype: int64

s2 = Series([4,5,6],index=['c','d','e'])
Out[35]: 
c    4
d    5
e    6
dtype: int64

data2 = pd.concat([s1,s2],keys=['one','two'])#####Connection function, keys used to distinguish which part of the result is s1 and which part is s2
Out[37]: 
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack() ####Two-level index, the innermost index with a default rotation level of 1
#####Encountered default value introducing NAN
Out[38]: 
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

#####stack filters out default data by default, so it can be inverted
#####Set dropna parameter changes to handle default values
#####Default dropna=True
data2.unstack().stack()
###Rotate result inserts innermost, lowest level by default
Out[39]: 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

data2.unstack().stack(dropna=False)
###Rotate result inserts innermost, lowest level by default
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

Similar to the processing of DataFrame data, operations and insertion results default to the innermost, the lowest level of hierarchy and the highest number.example

df = DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
Out[42]: 
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

####Column has only one layer, side, row has two layers,'state'=0,'number'=1

df.unstack('state')####Row'state'rotates to column and inserts innermost/lowest layer
Out[43]: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

df.unstack('state').stack('side')###Rotate column'side'to row based on previous result and insert innermost
Out[44]: 
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7