Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (3)
Merge overlapping data
Numpy.where()
A vectorized if-else
import pandas as pd;import numpy as np
from pandas import Series,DataFrame
a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
Out[8]:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
b = Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
Out[10]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
np.where(pd.isnull(a),b,a) #if pd.isnull(a):b else:a
#Returns tuples of data type ndarray/ndarray
Out[11]: array([ 0. , 2.5, 2. , 3.5, 4.5, nan])
###Get book output
Series(np.where(pd.isnull(a),b,a),index=['f','e','d','c','b','a'])
Out[13]:
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
combine_first()
You can do the same thing where and align your data
In [14]:b[:-2]
Out[14]:
f 0.0
e 1.0
d 2.0
c 3.0
dtype: float64
In [15]:a[2:]
Out[15]:
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
###if pd.isnull(b[:-2]):a[2:] else:b
###Attention to data alignment
b[:-2].combine_first(a[2:])
Out[16]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
#####The DataFrame can do the same thing, as if the data in the parameter object patches the missing data in the calling object
df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':range(2,18,4)})
Out[18]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
Out[20]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
Out[21]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
Reshape Hierarchical Index
- stack: Rotate column to row
- unstack: Rotate rows to columns
data = DataFrame(np.arange(6).reshape(2,3),index = pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
Out[23]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
###Rotate stack method column to row to get Series
result = data.stack()
Out[25]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
####Similarly Hierarchized Series Rearranged to DataFrame
result.unstack()
Out[26]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
######By default, stack/unstack operates on the innermost layer, that is, the lowest level, and then the lowest level as the axis of rotation.
#####unstack/stack other levels by passing in the number or name of the hierarchical level
result.unstack(0)##Equivalent toResult.unstack('state')
Out[28]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
Details:
data
number | one | two | three |
---|---|---|---|
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
data.stack() The method rotates a column into a row, where the column of data has only one layer of'number', so rotate the'number'layer. For each row, example,'Ohio' row contains three columns, one:0, two:1, three:2, rotate to a row, which becomes a hierarchical {'Ohio':{'one':0,'two':1,'three':2}} table as follows:
Ohio | one | 0 |
---|---|---|
two | 1 | |
three | 2 |
For row'Colorado'the same,data.stack() The result of rotation is
state | number | |
---|---|---|
Ohio | one | 0 |
two | 1 | |
three | 2 | |
Colorado | one | 3 |
two | 4 | |
three | 5 |
It retains the original data column name numberTo row name, because the stack default method is to go to the innermost layer, that is, the lowest layer, and the result of the transfer is also to go to the innermost layer, that is, the lowest layer. So'number'is inside the'state', and then to the hierarchical index, Seres, two-level index, level is, state=0, number=1, the smaller the number represents the higher level.
Handling of default values
s1 = Series([0,1,2,3],index=['a','b','c','d'])
Out[34]:
a 0
b 1
c 2
d 3
dtype: int64
s2 = Series([4,5,6],index=['c','d','e'])
Out[35]:
c 4
d 5
e 6
dtype: int64
data2 = pd.concat([s1,s2],keys=['one','two'])#####Connection function, keys used to distinguish which part of the result is s1 and which part is s2
Out[37]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack() ####Two-level index, the innermost index with a default rotation level of 1
#####Encountered default value introducing NAN
Out[38]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
#####stack filters out default data by default, so it can be inverted
#####Set dropna parameter changes to handle default values
#####Default dropna=True
data2.unstack().stack()
###Rotate result inserts innermost, lowest level by default
Out[39]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data2.unstack().stack(dropna=False)
###Rotate result inserts innermost, lowest level by default
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
Similar to the processing of DataFrame data, operations and insertion results default to the innermost, the lowest level of hierarchy and the highest number.example
df = DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
Out[42]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
####Column has only one layer, side, row has two layers,'state'=0,'number'=1
df.unstack('state')####Row'state'rotates to column and inserts innermost/lowest layer
Out[43]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
df.unstack('state').stack('side')###Rotate column'side'to row based on previous result and insert innermost
Out[44]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7