# Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (3)

Posted by Zeekar on Tue, 07 Jul 2020 17:27:05 +0200

# Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (3)

## Merge overlapping data

### Numpy.where()

A vectorized if-else

```import pandas as pd;import numpy as np
from pandas import Series,DataFrame

a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
Out[8]:
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

b = Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
Out[10]:
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

np.where(pd.isnull(a),b,a)  #if pd.isnull(a):b else:a
#Returns tuples of data type ndarray/ndarray
Out[11]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
###Get book output
Series(np.where(pd.isnull(a),b,a),index=['f','e','d','c','b','a'])
Out[13]:
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

```

### combine_first()

You can do the same thing where and align your data

```In [14]:b[:-2]
Out[14]:
f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64
In [15]:a[2:]
Out[15]:
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
###if pd.isnull(b[:-2]):a[2:] else:b
###Attention to data alignment
b[:-2].combine_first(a[2:])
Out[16]:
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

#####The DataFrame can do the same thing, as if the data in the parameter object patches the missing data in the calling object
df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':range(2,18,4)})
Out[18]:
a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
Out[20]:
a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

df1.combine_first(df2)
Out[21]:
a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

```

### Reshape Hierarchical Index

• stack: Rotate column to row
• unstack: Rotate rows to columns
```data = DataFrame(np.arange(6).reshape(2,3),index = pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
Out[23]:
number    one  two  three
state
Ohio        0    1      2

###Rotate stack method column to row to get Series
result = data.stack()
Out[25]:
state     number
Ohio      one       0
two       1
three     2
two       4
three     5
dtype: int32

####Similarly Hierarchized Series Rearranged to DataFrame
result.unstack()
Out[26]:
number    one  two  three
state
Ohio        0    1      2

######By default, stack/unstack operates on the innermost layer, that is, the lowest level, and then the lowest level as the axis of rotation.
#####unstack/stack other levels by passing in the number or name of the hierarchical level
result.unstack(0)##Equivalent toResult.unstack('state')
Out[28]:
number
one        0         3
two        1         4
three      2         5
```

Details:
data

number one two three
state
Ohio 0 1 2

data.stack() The method rotates a column into a row, where the column of data has only one layer of'number', so rotate the'number'layer. For each row, example,'Ohio' row contains three columns, one:0, two:1, three:2, rotate to a row, which becomes a hierarchical {'Ohio':{'one':0,'two':1,'three':2}} table as follows:

Ohio one 0
two 1
three 2

For row'Colorado'the same,data.stack() The result of rotation is

state number
Ohio one 0
two 1
three 2
two 4
three 5

It retains the original data column name numberTo row name, because the stack default method is to go to the innermost layer, that is, the lowest layer, and the result of the transfer is also to go to the innermost layer, that is, the lowest layer. So'number'is inside the'state', and then to the hierarchical index, Seres, two-level index, level is, state=0, number=1, the smaller the number represents the higher level.

Handling of default values

```s1 = Series([0,1,2,3],index=['a','b','c','d'])
Out[34]:
a    0
b    1
c    2
d    3
dtype: int64

s2 = Series([4,5,6],index=['c','d','e'])
Out[35]:
c    4
d    5
e    6
dtype: int64

data2 = pd.concat([s1,s2],keys=['one','two'])#####Connection function, keys used to distinguish which part of the result is s1 and which part is s2
Out[37]:
one  a    0
b    1
c    2
d    3
two  c    4
d    5
e    6
dtype: int64

data2.unstack() ####Two-level index, the innermost index with a default rotation level of 1
#####Encountered default value introducing NAN
Out[38]:
a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

#####stack filters out default data by default, so it can be inverted
#####Set dropna parameter changes to handle default values
#####Default dropna=True
data2.unstack().stack()
###Rotate result inserts innermost, lowest level by default
Out[39]:
one  a    0.0
b    1.0
c    2.0
d    3.0
two  c    4.0
d    5.0
e    6.0
dtype: float64

data2.unstack().stack(dropna=False)
###Rotate result inserts innermost, lowest level by default
one  a    0.0
b    1.0
c    2.0
d    3.0
e    NaN
two  a    NaN
b    NaN
c    4.0
d    5.0
e    6.0
dtype: float64
```

Similar to the processing of DataFrame data, operations and insertion results default to the innermost, the lowest level of hierarchy and the highest number.example

```df = DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
Out[42]:
side             left  right
state    number
Ohio     one        0      5
two        1      6
three      2      7
two        4      9
three      5     10

####Column has only one layer, side, row has two layers,'state'=0,'number'=1

df.unstack('state')####Row'state'rotates to column and inserts innermost/lowest layer
Out[43]:
side   left          right
number
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

df.unstack('state').stack('side')###Rotate column'side'to row based on previous result and insert innermost
Out[44]: