Pandas Library of Python -- pandas data structure

Posted by norpel on Tue, 21 Jan 2020 15:04:09 +0100

This blog is a reading note of "data analysis by Python". Please do not reprint it for other business purposes.

1,Series

Series is a one-dimensional array object, which contains a sequence of values and data labels, called index es. The simplest sequence can be just an array:

import pandas as pd

obj = pd.Series([4, 7, -5, 3])
print(obj)

#
0    4
1    7
2   -5
3    3
dtype: int64

The string representation of Series in the interactive environment, with index on the left and value on the right. Since no index is specified for the data, the default generated index is from 0 to N-1 (N is the length of the data). We can obtain the values and indexes of the Series objects through the values and index attributes respectively:

print(obj.values)
print(obj.index)

#
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)

It is usually necessary to create an index sequence, with labels identifying each data point:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index)

#
d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with Numpy, we can use labels to index when selecting data from data:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2['a'])

obj2['d'] = 6
print(obj2[['c', 'a', 'd']])

#
-5
c    3
a   -5
d    6
dtype: int64

It should be noted that we use two brackets when outputting the values of obj 2 with indexes of C, a and D. ['c ',' a ',' d '] contains not a number but a string as an index list.

Using Numpy functions or Numpy style operations, such as filtering with Boolean arrays, multiplying with scalars, or applying mathematical functions, will save index value links:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2[obj2 > 0])
print(obj2 * 2)
print(np.exp(obj2))

#
d    4
b    7
c    3
dtype: int64

d     8
b    14
a   -10
c     6
dtype: int64

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
Considering Series from another perspective, you can think of it as a fixed and ordered dictionary because it pairs the data values of index values by location.
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print('b' in obj2)
print('e' in obj2)

#
True
False

If you already have data included in the Python dictionary, you can use the dictionary to generate a Series:

sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
print(obj3)

#
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When the dictionary is passed to the Series constructor, the resulting Series index will be the sorted dictionary key. You can pass dictionary keys to the constructor in the order you want, so that the index order of the resulting Series is as expected:

sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)

#
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

"NA" indicates missing data. isnull and notnull are used in pandas to check for missing data:

print(pd.isnull(obj4))
print(pd.notnull(obj4))

#
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

isnull and notnull are also instance methods of Series:

print(obj4.isnull())

#
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

The Series object itself and its index have the name attribute, which is integrated with other important functions of pandas:

obj4.name = 'population'
obj4.index.name = 'states'
print(obj4)

#
states
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

The index of the Series can be changed by assigning values by location:

obj = pd.Series([4, 7, -5, 3])
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)

#
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

2,DataFrame

DataFrame represents the data table of the matrix, which contains the sorted column collection. Each column can be of different value types (value, string, Boolean value, etc.). DataFrame has both row and column indexes. It can be regarded as a dictionary of Series sharing the same index. In a DataFrame, data is stored as more than one two-dimensional block, rather than a collection of lists, dictionaries, or other one-dimensional arrays.

Although DataFrame is two-dimensional, we can use hierarchical index to present data with higher dimensions in DataFrame. Hierarchical index is a more advanced data processing feature in pandas.

There are many ways to build a DataFrame. The most common way is to use a dictionary that contains an equal length list or a Numpy array to form a DataFrame:

data = {'states': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame)

#
   states  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

The resulting DataFrame automatically assigns indexes to the Series, and the columns are sorted. For a large DataFrame, the head method will select only five rows of the header:

print(frame.head())

#
   states  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

If the order of columns is specified, the columns of DataFrame will be arranged in the specified order:

frame1 = pd.DataFrame(data, columns=['year', 'states', 'pop'])
print(frame1)

#
   year  states  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

If the passed column is not included in the dictionary, the missing value will appear in the result:

frame2 = pd.DataFrame(data, columns=['year', 'states', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)

#
       year  states  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

A column in a DataFrame that can be retrieved as a Series as a dictionary type tag or attribute:

print(frame2['states'])
print(frame2.year)

#
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: states, dtype: object

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Note that the returned Series has the same index as the original DataFrame, and the name property of the Series will also be set properly. Rows can also be selected by the stop or special attribute loc:

print(frame2.loc['three'])

#
year      2002
states    Ohio
pop        3.6
debt       NaN
Name: three, dtype: object

The reference of the column can be modified. For example, an empty "debt" column can be assigned as a scalar value or an array value:

frame2['debt'] = 16.5
print(frame2)

#
       year  states  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5
frame2['debt'] = np.arange(6)
print(frame2)

#
       year  states  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5

When we assign a list or array to a column, the length of the value must match the length of the DataFrame. If we assign a Series to a column, the indexes of the Series will be rearranged according to the indexes of DataFrame, and the missing values will be filled in the empty places:

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)

#
       year  states  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN

If the assigned column does not exist, a new column is generated. The del keyword can remove columns from a DataFrame as it does in a dictionary.

In the del example, we first add a column, which is a Boolean value. The condition is whether the states column is' Ohio ':

frame2['eastern'] = frame2.states == 'Ohio'
print(frame2)

#
       year  states  pop debt  eastern
one    2000    Ohio  1.5  NaN     True
two    2001    Ohio  1.7  NaN     True
three  2002    Ohio  3.6  NaN     True
four   2001  Nevada  2.4  NaN    False
five   2002  Nevada  2.9  NaN    False
six    2003  Nevada  3.2  NaN    False

Note: the syntax for frame.eastern cannot create a new column.

The del method can be used to remove previously created columns:

del frame2['eastern']
print(frame2.columns)

#
Index(['year', 'states', 'pop', 'debt'], dtype='object')

Note: the columns selected from the DataFrame are views of the data, not copies. As a result, changes to the Series are mapped to the DataFrame. If you need replication, you should explicitly use the copy method of Series.

Another common form of data is nested dictionaries that contain dictionaries. If the nested dictionary is assigned to the DataFrame, pandas will explicitly use the key of the dictionary as a column and the key of the internal dictionary as a row:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
print(frame3)

#
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5

You can use Numpy like syntax to transpose the DataFrame (swap rows and columns):

print(frame3.T)

#
        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5

The key of the internal dictionary is combined and sorted to form the index of the result. If the index is explicitly indicated, the keys of the internal dictionary will not be sorted:

print(pd.DataFrame(pop, index=[2001, 2002, 2003]))
#
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

Dictionaries containing Series can also be used to construct dataframes:

pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
print(pd.DataFrame(pdata))

#
      Ohio  Nevada
2001   1.7     2.4
2002   3.6     2.9

Similar to Series, the values property of DataFrame returns the data contained in DataFrame in the form of 2D ndarray:

print(frame3.values)

#
[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]

If the columns of DataFrame are different dtypes, the dtype of values will automatically select the type suitable for all columns:

print(frame2.values)

#
[[2000 'Ohio' 1.5 nan]
 [2001 'Ohio' 1.7 nan]
 [2002 'Ohio' 3.6 nan]
 [2001 'Nevada' 2.4 nan]
 [2002 'Nevada' 2.9 nan]
 [2003 'Nevada' 3.2 nan]]

Valid input for DataFrame constructor

type

Notes

2D ndarray

Data matrix, row and column labels are optional parameters

A dictionary of arrays, lists, and tuples

Each sequence is called a column of DataFrame, and all sequences must be the same length

NumPy structured / recorded array

Consistent with the dictionary of array composition

Dictionary of Series

Each value is called a column, and the indexes of each Series are combined to form the row index of the result, which can also be passed explicitly

A dictionary made up of dictionaries

Each internal dictionary is called a column, and the key is combined to form the row index of the result

List of dictionaries or Series

An element in the list forms a row of DataFrame, and the dictionary key or Series index are combined to form the column label of DataFrame

A list of lists or tuples

Consistent with 2D ndarray

Other dataframes If the pass through index is not displayed, the index of the original DataFrame is used NumPy MaskedArray Similar to 2D ndarray, but hidden values are called NA / missing values in the resulting DataFrame

3. Index object

Index objects in pandas are used to store axis labels and other metadata. When constructing a Series or DataFrame, any array or label we use can be internally converted to an index object:

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(obj)
print(index)
print(index[1:])

#
a    0
b    1
c    2
dtype: int64
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')

The index object is immutable, so users cannot modify it. Invariance is that it is safer to share index objects in multiple data structures:

labels = pd.Index(np.arange(3))
print(labels)

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)

print(obj2.index is labels)

#
Int64Index([0, 1, 2], dtype='int64')

0    1.5
1   -2.5
2    0.0
dtype: float64

True

Some users do not often use the functions provided by index objects, but because some operations will produce results containing indexed data, it is important to understand how indexes work.

Methods and properties of some index objects
Method describe
append After pasting additional index objects into the original index, a new index is generated
difference Calculating the difference set of two indexes
intersection Calculate the intersection of two indexes
union Calculating the union of two indexes
isin Computes a Boolean array that indicates whether each value is in the pass through container
delete Delete the element at location i and generate a new index
drop Delete the specified index value according to the parameter and generate a new index
insert Insert the element at position i and generate a new index
is_monotonic Returns True if index sequence is incremented
is_unique Returns True if index sequence is unique
unique Calculate unique sequence value of index

 

 

 

Published 18 original articles, won praise 6, visited 356
Private letter follow

Topics: Attribute Python