Pandas: deeply understand the data structure of pandas

brief introduction

This article will explain the basic data types Series and DataFrame in Pandas, and explain in detail the creation, indexing and other basic behaviors of these two types.

To use Pandas, you need to reference the following lib:

In [1]: import numpy as np

In [2]: import pandas as pd


Series is a one-dimensional array with label and index. We use the following method to create a series:

>>> s = pd.Series(data, index=index)

The data here can be a Python dictionary, np's ndarray, or a scalar.

index is a list of horizontal label s. Next, let's take a look at how to create Series.

Create from ndarray

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a   -1.300797
b   -2.044172
c   -1.170739
d   -0.445290
e    1.208784
dtype: float64

Use index to get index:

Out[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Create from dict

d = {'b': 1, 'a': 0, 'c': 2}

a    0
b    1
c    2
dtype: int64

Create from scalar

pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series and ndarray

Series and ndarray are very similar. Using the index value in series is like ndarray:

Out[72]: -1.3007972194268396

a   -1.300797
b   -2.044172
c   -1.170739
dtype: float64

s[s > s.median()]
d   -0.445290
e    1.208784
dtype: float64

s[[4, 3, 1]]
e    1.208784
d   -0.445290
b   -2.044172
dtype: float64

Series and dict

If you use label to access Series, it behaves much like dict:

Out[80]: -1.3007972194268396

s['e'] = 12.

a    -1.300797
b    -2.044172
c    -1.170739
d    -0.445290
e    12.000000
dtype: float64

Vectorization and label alignment

Series can use simpler vectorization operations:

s + s
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

s * 2
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

a         0.272315
b         0.129487
c         0.310138
d         0.640638
e    162754.791419
dtype: float64

Name property

Series also has a name attribute, which we can set when creating:

s = pd.Series(np.random.randn(5), name='something')

0    0.192272
1    0.110410
2    1.442358
3   -0.375792
4    1.228111
Name: something, dtype: float64

S also has a rename method, which can rename s:

s2 = s.rename("different")


DataFrame is a two-dimensional data structure with label. It is composed of Series. You can regard DataFrame as an excel table. DataFrame can be created from the following data:

  • One dimensional ndarrays, lists, dicts, or Series
  • Structured array creation
  • 2-dimensional numpy ndarray
  • Other dataframes

Create from Series

You can create a DataFrame from a dictionary composed of Series:

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Perform index rearrangement:

pd.DataFrame(d, index=['d', 'b', 'a'])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

Perform column rearrangement:

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

Create from ndarrays and lists

d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

Create from structured array

You can create a DF from a structured array:

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

Create from dictionary list

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])
   a   b
0  1   2
1  5  10

Create from tuples

You can create more complex DF S from tuples:

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

Column selection, addition and deletion

DF can be operated like Series:

In [64]: df['one']
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [65]: df['three'] = df['one'] * df['two']

In [66]: df['flag'] = df['one'] > 2

In [67]: df
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

You can delete specific columns or pop:

In [68]: del df['two']

In [69]: three = df.pop('three')

In [70]: df
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

If you insert a constant, the entire column will be filled:

In [71]: df['foo'] = 'bar'

In [72]: df
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

By default, it will be inserted into the last column in DF. You can use insert to specify the insertion into a specific column:

In [75]: df.insert(1, 'bar', df['one'])

In [76]: df
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN

Use assign to derive new columns from existing columns:

In [77]: iris = pd.read_csv('data/')

In [78]: iris.head()
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

Note that assign will create a new DF, and the original DF will remain unchanged.

The index and selection in DF are represented by a table below:

operationgrammarReturn results
Select columndf[col]Series
Select a row through labeldf.loc[label]Series
Select rows by arraydf.iloc[loc]Series
Slice of rowdf[5:10]DataFrame
Select rows using boolean vectorsdf[bool_vec]DataFrame

