Learning python Pandas Library

Posted by doremi on Thu, 09 Dec 2021 08:58:33 +0100

Pandas

Series object

The Series object is used to represent a one-dimensional data structure. Each element of its main array will have an associated label. (roughly as shown in the figure below)

Object declaration

By PD Series(). If no label is specified, the value incremented once from 0 is used as the label by default.

s = pd.Series([12, -4, 7, 9])
print(s)
------------------------------------
0    12
1    -4
2     7
3     9
dtype: int64
s = pd.Series([12, -4, 7, 9], index=["A", "B", "C", "D"])
print(s)
-------------------------------
A    12
B    -4
C     7
D     9
dtype: int64

Of course, objects can also be declared in other forms (the operation is as follows:

Note that in this case, if the value of the original object changes, the elements in the Series object will also change.

arr = np.array([1, 2, 3, 4])
s = pd.Series(arr)
print(s)
--------------------------------
0    1
1    2
2    3
3    4
dtype: int32

Series can also be initialized using a dictionary (its format is almost the same as that of a dictionary).

my_dict = {"red":100, "blue":200,"green":300}
s = pd.Series(my_dict)
print(s)
-----------------------------------
red      100
blue     200
green    300
dtype: int64

Property acquisition

There is no doubt that you can get two properties of Seris, Index and Values.

s = pd.Series([12, -4, 7, 9], index = ["A", "B", "C", "D"])
print(s)
print("values:",s.values)
print("index:",s.index)
----------------------------------
A    12
B    -4
C     7
D     9
dtype: int64
values: [12 -4  7  9]
index: Index(['A', 'B', 'C', 'D'], dtype='object')

Object operation

Element assignment

Assign values to elements by subscript or index.

s = pd.Series([12, -4, 7, 9], index=["A", "B", "C", "D"])
print(s)
s[0] = 0
print("0:",s[0])
s["B"] = 0
print("B",s["B"])
-----------------------------------
A    12
B    -4
C     7
D     9
dtype: int64
0: 0
B 0

Element filtering

S [S > x], used to filter elements in the Series array whose element value is greater than x.

s = pd.Series([10, 5, -5, 12], index=["A", "B", "C", "D"])
print(s)
print()
print(s[s > 8])
--------------------------
A    10
B     5
C    -5
D    12
dtype: int64

A    10
D    12
dtype: int64

Element operation

Since Series can be regarded as an upgraded version of np.array, the performance of both object operation and mathematical function is the same. (addition, subtraction, multiplication and division, logarithm, etc., no more demonstration

import pandas as pd
s = pd.Series([12, 5, 4, 9])
print(s)
s = s - 2
print(s)
-------------------------
0    12
1     5
2     4
3     9
dtype: int64
0    10
1     3
2     2
3     7
dtype: int64

Element de duplication

To de duplicate a Series object, you need to call the function unique(). Note that the order may change after de duplication.

import pandas as pd
s = pd.Series([12, 5, 9, 4, 4, 5, 5])
print("Before weight removal:",s.values)
s = s.unique()
print("After weight removal",s)
--------------------------------------
Before weight removal: [12  5  9  4  4  5  5]
After weight removal [12  5  9  4]

Element occurrence frequency

The frequency of elements in the function $Series.value_counts() can be calculated.

s = pd.Series([12, 5, 5, 5, 1, 1, 2])
print(s.value_counts())
---------------------------------
5     3
1     2
12    1
2     1
dtype: int64

Ownership judgment

Use the function Series.isin() to judge. In addition, it can also be used as a filter condition.

s = pd.Series([1, 5, 8, 10, 15])
print(s.isin([5,10]))
print()
print(s[s.isin([5, 10])])
--------------------------------------
0    False
1     True
2    False
3     True
4    False
dtype: bool

1     5
3    10
dtype: int64

NaN

NaN, i.e. Not a Number, can be used when a property is lost (or not collected) during data processing. In addition, it can be used in combination with isnull() and notnull(). (No

In addition, please note that although NaN is introduced in Pandas, its declaration is np.NaN.

isnull() is a function that isnull can regard as an attribute and can be used to filter elements.

s = pd.Series([12, 5, np.NaN, 10])
print(s[pd.isnull])
print()
print(s.isnull())
print()
print(s.notnull())
-----------------------------------
2   NaN
dtype: float64

0    False
1    False
2     True
3    False
dtype: bool

0     True
1     True
2    False
3     True
dtype: bool

Operation of object

Considering the operation between Series objects, it is actually similar to the broadcast mechanism of array, but the difference is that only the attributes with a common label (index) will have results, otherwise it is Nan (non number).

s1 = pd.Series([10, 5, 2],index=["R", "G", "B"])
s2 = pd.Series([5, 2, 1],index=["G", "B", "F"])
print(s1 + s2)
----------------------------------
B     4.0
F     NaN
G    10.0
R     NaN
dtype: float64

DataFrame object

DateFrame is actually a multidimensional Series (or a combination of multiple Series). It has two indexes, row index and column index (you can think of the column index in DataFrame as the row index in a single Series).

You can also think of DataFrame as a dictionary composed of Series.

Object declaration

It can be initialized through a dictionary + list.

my_dict = {"color":["R", "G", "B"],
           "object":["you", "me", "him"],
           "value":[1, 3, 5]}

frame = pd.DataFrame(my_dict,index=["A", "B", "C"])
print(frame)
-------------------------------------
  color object  value
A     R    you      1
B     G     me      3
C     B    him      5

The attribute clolumns exists in the DataFrame, which can purify the list or be used when initializing the DataFrame.

my_dict = {"color":["R", "G", "B"],
           "object":["you", "me", "him"],
           "value":[1, 3, 5]}
frame = pd.DataFrame(my_dict, columns=["color","value"])
print(frame)
-------------------------------------
  color  value
0     R      1
1     G      3
2     B      5
frame = pd.DataFrame(np.arange(25).reshape(5, 5), columns=["A", "B", "C", "D", "E"],
                     index=["1", "2", "3", "4", "5"])
print(frame)
------------------------------------
    A   B   C   D   E
1   0   1   2   3   4
2   5   6   7   8   9
3  10  11  12  13  14
4  15  16  17  18  19
5  20  21  22  23  24

Object operation

Element selection

Considering that DataFrames have many properties, there are many methods to select elements.

DataFrame.columns: get the names of all columns (actually the top row)

dict_data = {"color":["R", "G", "B"],
             "object":["pen","pan","pencil"],
             "value":[100, 200, 300]}
frame = pd.DataFrame(dict_data)
print(frame.columns)
--------------------------------
Index(['color', 'object', 'value'], dtype='object')

DataFrame.values: get each element in it and store it as a list. ([column name 1_1, column name 1_2,...], [],...)

dict_data = {"color":["R", "G", "B"],
             "object":["pen","pan","pencil"],
             "value":[100, 200, 300]}
frame = pd.DataFrame(dict_data)
print(frame.values)
------------------------------------------
[['R' 'pen' 100]
 ['G' 'pan' 200]
 ['B' 'pencil' 300]]

DataFrame.index: get the row index (actually the row standing on the left ~)

dict_data = {"color":["R", "G", "B"],
             "object":["pen","pan","pencil"],
             "value":[100, 200, 300]}
frame = pd.DataFrame(dict_data)
print(frame.index)
-------------------------------------
RangeIndex(start=0, stop=3, step=1)

Get directly through the index (since it is [], you can certainly get pull ~)

dict_data = {"color":["R", "G", "B"],
             "object":["pen","pan","pencil"],
             "value":[100, 200, 300]}
frame = pd.DataFrame(dict_data)
print(frame["value"])
print()
print(frame.loc[1])
print()
print(frame["object"][1])
----------------------------
0    100
1    200
2    300
Name: value, dtype: int64
        
color       G
object    pan
value     200
Name: 1, dtype: object
        
pan

Multiline selection (i.e. slicing of arrays)

It should be noted here that frame[0:2] is the correct operation, while frame[0] is the wrong operation!

dict_data = {"color":["R", "G", "B"],
             "object":["pen","pan","pencil"],
             "value":[100, 200, 300]}
frame = pd.DataFrame(dict_data)
print(frame[0:2])
---------------------------------
  color object  value
0     R    pen    100
1     G    pan    200

Element assignment

The assignment of some values is like a Series object, but there is also the operation of adding columns in the DataFrame~

ps: I can't understand why here A is in the front

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
frame["A"]["a"] = 100
print(frame)
-------------------------------
     A   B   C   D   E
a  100   1   2   3   4
b    5   6   7   8   9
c   10  11  12  13  14
d   15  16  17  18  19
e   20  21  22  23  24

Add a new column. Simply use frame [] = [], which is similar to adding a dictionary.

However, please keep the number of elements consistent~

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
frame["hahaha!"] = [1, 2, 3, 4, 5]
print(frame)
---------------------------
    A   B   C   D   E  hahaha!
a   0   1   2   3   4        1
b   5   6   7   8   9        2
c  10  11  12  13  14        3
d  15  16  17  18  19        4
e  20  21  22  23  24        5

Affiliation

It is roughly the same as that in Series, which is not explained too much.

(ps: you can compare the results above and below

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
print(frame.isin([4, 1]))
------------------------------------
	   A      B      C      D      E  hahaha!
a  False   True  False  False   True     True
b  False  False  False  False  False    False
c  False  False  False  False  False    False
d  False  False  False  False  False     True
e  False  False  False  False  False    False

Element filtering

Same as above. (just look at the code

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
print(frame[frame >= 20])
#Here, the array name may also represent the variable name of each element
-------------------------------------
      A     B     C     D     E  hahaha!
a   NaN   NaN   NaN   NaN   NaN      NaN
b   NaN   NaN   NaN   NaN   NaN      NaN
c   NaN   NaN   NaN   NaN   NaN      NaN
d   NaN   NaN   NaN   NaN   NaN      NaN
e  20.0  21.0  22.0  23.0  24.0      NaN

Delete column

Call del DataFrame [column name] (it is widely used, but it is also explained)

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
del frame["A"]
print(frame)
-----------------------------
    B   C   D   E
a   1   2   3   4
b   6   7   8   9
c  11  12  13  14
d  16  17  18  19
e  21  22  23  24

Transpose

Transpose by calling the property T in the DataFrame.

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns=["A", "B", "C", "D", "E"],
                     index=["a", "b", "c", "d", "e"])
print(frame.T)
---------------------------
   a  b   c   d   e
B  1  6  11  16  21
C  2  7  12  17  22
D  3  8  13  18  23
E  4  9  14  19  24

Index object

The Index object is the axis (horizontal and vertical) of the array used above.

Method use

Get the minimum value of the index, XXX.idxmin().

s = pd.Series([5, 10, 15, 20],index=["a1", "b2", "c3", "d4"])
print(s.index)
print(s.idxmin())
---------------------------
Index(['a1', 'b2', 'c3', 'd4'], dtype='object')
a1

Gets the maximum value of the index, XXX.idxmax().

ps: the comparison method is to compare various strings. The comparison method between strings will not be explained too much.

s = pd.Series([5, 10, 15, 20],index=["a1", "b2", "c3", "d4"])
print(s.index)
print(s.idxmax())
----------------------------------------
Index(['a1', 'b2', 'c3', 'd4'], dtype='object')
d4

Index replacement

When you are not satisfied with the current index in the array, you can use the reindex() function to replace the index of the object.

In addition, if a new index is added, NaN will be added as the value of the new index.

(in addition, why not initialize the index to your satisfaction from the beginning (No

s = pd.Series([5, 10, 15, 20],index=["a1", "b2", "c3", "d4"])
print(s)
s = s.reindex(["a1", "B", "c3", "D","d4"])
print(s)
-----------------------------
a1     5
b2    10
c3    15
d4    20
dtype: int64

a1     5.0
B      NaN
c3    15.0
D      NaN
d4    20.0
dtype: float64

reindex() has the parameter method, which is the filling method of the parameter.

If method = "fill", i.e. front fill, fill the current NaN value with the previous subscript.

method = "bfill", i.e. back fill, is opposite to the above description.

frame = pd.DataFrame(np.arange(9).reshape(3, 3),columns=["A", "B", "C"],
                     index=["a", "b", "c"])
print(frame)
frame = frame.reindex(["a", "d", "daa!", "b"],method="ffill")
print(frame)
------------------------------------------
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8

       A  B  C
a      0  1  2
d      6  7  8
daa!  6  7  8
b      3  4  5

Or, you can directly use fill_value = 0, that is, fill the default NaN to 0.

(this will not be demonstrated anymore) ▔﹏▔▔▔▔▔▔

Delete index

It is the same as del Array [], but it is handled by another function. drop().

ps: Series(DataFrame, too) seems to be immutable, so use s = s.drop("A") instead of just s.drop("A"). Remember.

s = pd.Series(np.arange(4), index=["A", "B", "C", "d"])
print(s)
print()
s = s.drop("A")
print(s)
--------------------------
A    0
B    1
C    2
d    3
dtype: int32

B    1
C    2
d    3
dtype: int32

Data alignment

As with the addition between Series, due to the length is too long, it will not be explained.

Operations between data structures

Data structures can be directly operated by arithmetic operators. Of course, the same purpose can be achieved by the following methods.

​ add()

​ sub()

​ div ()

​ sub()

Words are like people, no longer explain.

Operation

Demonstrate the principle of operation through addition.

Note that the Series here is aligned with the column names of the DataFrame, not the index.

frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     index = ["A", "B", "C"],
                     columns=["a", "b", "c"])

ser = pd.Series(np.arange(3), index=["a", "b", "c"])

print(frame + ser)
-------------------------------
   a  b   c
A  0  2   4
B  3  5   7
C  6  8  10

Other operations are the same as the above demonstration.

Function application and mapping

General function

square root

np.sqrt() can find the square root of each element in the sequence.

frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     columns=["A", "B", "C"],
                     index=["a", "B", "c"])
print(np.sqrt(frame))
--------------------------------------------
    A         B         C
a  0.000000  1.000000  1.414214
B  1.732051  2.000000  2.236068
c  2.449490  2.645751  2.828427

Summation and average

np.sum() is used to find the sum of the elements of the sequence, and np.mean() is used to find the average value of the sequence. Among them, the more important is. describe().

frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     columns=["A", "B", "C"],
                     index=["a", "B", "c"])
print(np.sum(frame))
print()
print(np.mean(frame))
print()
print(frame.describe())
------------------------------
A     9
B    12
C    15
dtype: int64

A    3.0
B    4.0
C    5.0
dtype: float64
    
         A    B    C
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0    

Relevance

This can be achieved through DataFrame.corr().

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns = ["B", "C", "A", "D", "E"],
                     index=["a", "c", "b", "d", "e"])
print(frame.corr())
----------------------------
     B    C    A    D    E
B  1.0  1.0  1.0  1.0  1.0
C  1.0  1.0  1.0  1.0  1.0
A  1.0  1.0  1.0  1.0  1.0
D  1.0  1.0  1.0  1.0  1.0
E  1.0  1.0  1.0  1.0  1.0

Covariance

It can be realized simply through. cov() (please don't write the function yourself!

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns = ["B", "C", "A", "D", "E"],
                     index=["a", "c", "b", "d", "e"])
print(frame.cov())
-------------------------------
      B     C     A     D     E
B  62.5  62.5  62.5  62.5  62.5
C  62.5  62.5  62.5  62.5  62.5
A  62.5  62.5  62.5  62.5  62.5
D  62.5  62.5  62.5  62.5  62.5
E  62.5  62.5  62.5  62.5  62.5

Sorting and ranking

Through the function dataframe sort_ Index(), which has two parameters: ascending, which is used to specify whether to sort in descending order (False) or ascending order (True and the default value); for $axis, when axis = 0, it is sorted by column (and the default value), and when axis = 1, it is sorted by row.

Note that the permutations mentioned here are for Index rather than Value.

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns = ["B", "C", "A", "D", "E"],
                     index=["a", "c", "b", "d", "e"])
print(frame)
print()
print(frame.sort_index(ascending=False))
print()
print(frame.sort_index(axis=1))
----------------------------------------------
    B   C   A   D   E
a   0   1   2   3   4
c   5   6   7   8   9
b  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

    B   C   A   D   E
e  20  21  22  23  24
d  15  16  17  18  19
c   5   6   7   8   9
b  10  11  12  13  14
a   0   1   2   3   4

    A   B   C   D   E
a   2   0   1   3   4
c   7   5   6   8   9
b  12  10  11  13  14
d  17  15  16  18  19
e  22  20  21  23  24

Of course, you can also sort based on the columns you give, such as sort_values(by = ["A", "B"]), and the results will not be demonstrated.

Ranking operation

Use DataFrame.rank() to arrange seats (but I can't understand the practical significance).

frame = pd.DataFrame(np.arange(25).reshape(5, 5),
                     columns = ["B", "C", "A", "D", "E"],
                     index=["a", "c", "b", "d", "e"])
print(frame.rank())
--------------------------------------
     B    C    A    D    E
a  1.0  1.0  1.0  1.0  1.0
c  2.0  2.0  2.0  2.0  2.0
b  3.0  3.0  3.0  3.0  3.0
d  4.0  4.0  4.0  4.0  4.0
e  5.0  5.0  5.0  5.0  5.0

NaN data

NaN filtration

. dropna() and notnull() filtering can achieve the purpose of filtering out NaN elements.

When using dropna(), it should be noted that the whole row or column may be NaN, so you need to set the how parameter variable. If how = "any", the row or column will be deleted as long as there is a default value. If how = "all", it will be deleted only when all values are NaN.

frame = pd.DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan], [4, 5, 6]]
                     ,columns=["A", "B", "C"],
                     index=["a", "b", "c"])
print(frame)

print()

print(frame.dropna(how="all"))
--------------------------------------
     A    B    C
a  1.0  2.0  3.0
b  NaN  NaN  NaN
c  4.0  5.0  6.0


     A    B    C
a  1.0  2.0  3.0
c  4.0  5.0  6.0

Fill NaN

Use. fillna(x) to transform the NaN of the sequence into an x value~

frame = pd.DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan], [4, 5, 6]]
                     ,columns=["A", "B", "C"],
                     index=["a", "b", "c"])
print(frame)
print()
print(frame.fillna(0))
------------------------------------
     A    B    C
a  1.0  2.0  3.0
b  NaN  NaN  NaN
c  4.0  5.0  6.0

     A    B    C
a  1.0  2.0  3.0
b  0.0  0.0  0.0
c  4.0  5.0  6.0

Adjustment of hierarchical order

(it's too complicated and not commonly used. Let's skip it...)....

Topics: Python pandas