Ten minute introduction to pandas

Posted by filmixt on Fri, 19 Nov 2021 22:29:22 +0100

Translated from https://pandas.pydata.org/docs/user_guide/10min.html

Use version 0.23.0

pandas is a fast, powerful, flexible and easy-to-use open source data analysis and operation tool, which is built on the Python programming language.

pandas contains two structures

  • Series is a one-dimensional tag array that can hold any data type (integer, string, floating point number, Python object, etc.)
  • DataFrame is a two-dimensional labeled data structure with different types of columns. You can think of it as a spreadsheet or SQL table, or as a dictionary of Series objects. More commonly used than Series.

create object

Pass in a list and create a Series object with the default integer index

Series

import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame

Pass in a NumPy array, take the dates time as the index, set the corresponding column tag columns, and create a DataFrame object

import numpy as np
import pandas as pd
dates = pd.date_range("20130101", periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2013-01-01 -1.072976 -0.935738 -0.239158 -0.294826
2013-01-02 -1.127384 -0.567880  0.182306 -2.934319
2013-01-03  0.544147  0.183749 -1.041106 -1.440546
2013-01-04  0.115221 -0.152783 -1.984568 -0.561250
2013-01-05  1.126787 -0.290174  0.340003 -2.309574
2013-01-06 -0.560692 -0.647581 -0.563737 -0.377166

Pass in the dictionary. The default index is. The key of the dictionary is the corresponding column name. Create a DataFrame object

Different columns in the DataFrame can have different dtype s

f2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(f2)
print(f2.dtypes)
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

View object

View the top and bottom rows

View the top and bottom rows in the DataFrame

f2.head(2)
# 2 is the number of rows. It is not specified to view all data by default
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
f2.tail(2)
# 2 is the number of rows. It is not specified to view all data by default
     A          B    C  D      E    F
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

View index and column names

f2.index

Int64Index([0, 1, 2, 3], dtype='int64')
f2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

Obtain statistics

Returns the maximum value, minimum value, median, standard deviation, etc. of some columns (number type) in the DataFrame

print(f2.describe())

# Only ACD columns of numeric types float64, float32, int32 are processed
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

Transpose

print(df.T)

sort

raw data:

                   A         B         C         D
2013-01-01  0.187485 -2.743918  1.000627  1.730537
2013-01-02 -0.248065  0.051679  1.200797  2.189407
2013-01-03 -1.029446  0.525787 -0.737382  1.494209
2013-01-04 -0.882341 -1.104881  0.277582  0.154353
2013-01-05  0.332548 -0.874347 -0.188176 -0.546178
2013-01-06 -0.237779  0.196372 -0.304217 -0.596813
  1. Sort by axis. axis=1 means sort by column, axis=0 means sort by row, and ascending means ascending
print(df.sort_index(axis=1, ascending=False))
                   D         C         B         A
2013-01-01  1.730537  1.000627 -2.743918  0.187485
2013-01-02  2.189407  1.200797  0.051679 -0.248065
2013-01-03  1.494209 -0.737382  0.525787 -1.029446
2013-01-04  0.154353  0.277582 -1.104881 -0.882341
2013-01-05 -0.546178 -0.188176 -0.874347  0.332548
2013-01-06 -0.596813 -0.304217  0.196372 -0.237779
  1. Sort by the value of a column
print(df.sort_values(by="B"))

                   A         B         C         D
2013-01-01  0.187485 -2.743918  1.000627  1.730537
2013-01-04 -0.882341 -1.104881  0.277582  0.154353
2013-01-05  0.332548 -0.874347 -0.188176 -0.546178
2013-01-02 -0.248065  0.051679  1.200797  2.189407
2013-01-06 -0.237779  0.196372 -0.304217 -0.596813
2013-01-03 -1.029446  0.525787 -0.737382  1.494209

Select data

raw data:

                   A         B         C         D
2013-01-02  0.865408 -2.301539  1.744812 -0.761207
2013-01-05 -0.172428 -0.877858  0.042214  0.582815
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969
2013-01-04 -0.322417 -0.384054  1.133769 -1.099891
2013-01-03  0.319039 -0.249370  1.462108 -2.060141
2013-01-06 -1.100619  1.144724  0.901591  0.502494

Select a column of data according to the column name and return the Series type

Select data by column

print(df["A"])

2013-01-01    1.624345
2013-01-02    0.865408
2013-01-03    0.319039
2013-01-04   -0.322417
2013-01-05   -0.172428
2013-01-06   -1.100619
Freq: D, Name: A, dtype: float64

Select data by column name, select multiple columns, and return DataFrame type

print(df.loc[:, ["A", "B"]])

                   A         B
2013-01-01  1.624345 -0.611756
2013-01-02  0.865408 -2.301539
2013-01-03  0.319039 -0.249370
2013-01-04 -0.322417 -0.384054
2013-01-05 -0.172428 -0.877858
2013-01-06 -1.100619  1.144724

Select data by row

Select a row of data by row name

print(df.loc["2013-01-03 "])

A    0.319039
B   -0.249370
C    1.462108
D   -2.060141
Name: 2013-01-03 00:00:00, dtype: float64

Select multiple rows of data by row index

print(df[0:2])

                   A         B         C         D
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969
2013-01-02  0.865408 -2.301539  1.744812 -0.761207

Select data by row and column

Specify a row name and a column name, and select a value in the Dataframe

print(df.loc[dates[0], "A"])
print(df.at[dates[0], "A"])
# The above two sentences have the same effect

1.6243453636632417
1.6243453636632417

Specify a row name and multiple column names and select data

print(df.loc["20130102", ["A", "B"]])

A    0.865408
B   -2.301539
Name: 2013-01-02 00:00:00, dtype: float64

Specify the start row name and end row name, specify multiple column names, and select data

print(df.loc["20130102":"20130104", ["A", "B"]])

                   A         B
2013-01-02  0.865408 -2.301539
2013-01-03  0.319039 -0.249370
2013-01-04 -0.322417 -0.384054

Select data by location

Select the row with index 3, the fourth row

print(df.iloc[3])
# Same as df.loc["2013-01-04"]
# Cannot use df[3]
A   -0.322417
B   -0.384054
C    1.133769
D   -1.099891
Name: 2013-01-04 00:00:00, dtype: float64

Select columns 1 and 2 of the rows with indexes 3 and 4

print(df.iloc[3:5, 0:2])

                   A         B
2013-01-04 -0.322417 -0.384054
2013-01-05 -0.172428 -0.877858

Select columns 1 and 3 of the rows with indexes 1, 2, and 4

print(df.iloc[[1, 2, 4], [0, 2]])

                   A         C
2013-01-02  0.865408  1.744812
2013-01-03  0.319039  1.462108
2013-01-05 -0.172428  0.042214

All columns or rows are selected by default

print(df.iloc[1:3, :])

                   A         B         C         D
2013-01-02  0.865408 -2.301539  1.744812 -0.761207
2013-01-03  0.319039 -0.249370  1.462108 -2.060141
print(df.iloc[:, 1:3])

                   B         C
2013-01-01 -0.611756 -0.528172
2013-01-02 -2.301539  1.744812
2013-01-03 -0.249370  1.462108
2013-01-04 -0.384054  1.133769
2013-01-05 -0.877858  0.042214
2013-01-06  1.144724  0.901591

Specify a row or column and select a value

print(df.iloc[1, 1])
print(df.iat[1, 1])

-2.3015386968802827
-2.3015386968802827

Boolean index

Data is selected according to the true and false Boolean values

Select the value corresponding to column A in df that is greater than 0

print(df[df["A"] > 0])

                   A         B         C         D
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969
2013-01-02  0.865408 -2.301539  1.744812 -0.761207
2013-01-03  0.319039 -0.249370  1.462108 -2.060141

Select data greater than 0 in df, and use NaN for other data

                   A         B         C         D
2013-01-01  1.624345       NaN       NaN       NaN
2013-01-02  0.865408       NaN  1.744812       NaN
2013-01-03  0.319039       NaN  1.462108       NaN
2013-01-04       NaN       NaN  1.133769       NaN
2013-01-05       NaN       NaN  0.042214  0.582815
2013-01-06       NaN  1.144724  0.901591  0.502494

Change original data

Add a column named F to df. Through index connection, the unmatched ones are replaced by nan

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print(s1)
df["F"] = s1
print(df)
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
                   A         B         C         D    F
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969  NaN
2013-01-02  0.865408 -2.301539  1.744812 -0.761207  1.0
2013-01-03  0.319039 -0.249370  1.462108 -2.060141  2.0
2013-01-04 -0.322417 -0.384054  1.133769 -1.099891  3.0
2013-01-05 -0.172428 -0.877858  0.042214  0.582815  4.0
2013-01-06 -1.100619  1.144724  0.901591  0.502494  5.0

Change the original value in df

df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0
df.loc[:, "D"] = np.array([5] * len(df))
print(df)
                   A         B         C  D
2013-01-01  0.000000  0.000000 -0.528172  5
2013-01-02  0.865408 -2.301539  1.744812  5
2013-01-03  0.319039 -0.249370  1.462108  5
2013-01-04 -0.322417 -0.384054  1.133769  5
2013-01-05 -0.172428 -0.877858  0.042214  5
2013-01-06 -1.100619  1.144724  0.901591  5

Data missing processing

Pandas mainly uses the value np.nan to indicate missing data. It does not participate in the calculation by default

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
print(df1)
                   A         B         C         D    E
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969  1.0
2013-01-02  0.865408 -2.301539  1.744812 -0.761207  1.0
2013-01-03  0.319039 -0.249370  1.462108 -2.060141  NaN
2013-01-04 -0.322417 -0.384054  1.133769 -1.099891  NaN

Remove rows that contain missing values

print(df1.dropna(axis=0, how="any"))

                   A         B         C         D    E
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969  1.0
2013-01-02  0.865408 -2.301539  1.744812 -0.761207  1.0

Remove columns that contain missing values

print(df1.dropna(axis=1, how="any"))

                   A         B         C         D
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969
2013-01-02  0.865408 -2.301539  1.744812 -0.761207
2013-01-03  0.319039 -0.249370  1.462108 -2.060141
2013-01-04 -0.322417 -0.384054  1.133769 -1.099891

Set the missing value to a value

print(df1.fillna(value=5))

                   A         B         C         D    E
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969  1.0
2013-01-02  0.865408 -2.301539  1.744812 -0.761207  1.0
2013-01-03  0.319039 -0.249370  1.462108 -2.060141  5.0
2013-01-04 -0.322417 -0.384054  1.133769 -1.099891  5.0

Get the mask matrix for missing values

print(pd.isna(df1))

                A      B      C      D      E
2013-01-01  False  False  False  False  False
2013-01-02  False  False  False  False  False
2013-01-03  False  False  False  False   True
2013-01-04  False  False  False  False   True

operation

Average by row

print(df.mean(0))

A    0.202221
B   -0.546642
C    0.792720
D   -0.651483
dtype: float64

Average by column

print(df.mean(1))

2013-01-01   -0.147138
2013-01-02   -0.113132
2013-01-03   -0.132091
2013-01-04   -0.168148
2013-01-05   -0.106314
2013-01-06    0.362047
Freq: D, dtype: float64

The data in s is automatically broadcast along the specified dimension into df the data of the same dimension, and then subtracted from the corresponding position

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print(s)
print(df.sub(s, axis="index"))
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

                   A         B         C         D
2013-01-01       NaN       NaN       NaN       NaN
2013-01-02       NaN       NaN       NaN       NaN
2013-01-03 -0.680961 -1.249370  0.462108 -3.060141
2013-01-04 -3.322417 -3.384054 -1.866231 -4.099891
2013-01-05 -5.172428 -5.877858 -4.957786 -4.417185
2013-01-06       NaN       NaN       NaN       NaN

Applies a function to the data of the specified dimension

Cumulative sum by row

print(df.apply(np.cumsum,axis=0))

                   A         B         C         D
2013-01-01  1.624345 -0.611756 -0.528172 -1.072969
2013-01-02  2.489753 -2.913295  1.216640 -1.834176
2013-01-03  2.808792 -3.162665  2.678748 -3.894316
2013-01-04  2.486375 -3.546720  3.812517 -4.994207
2013-01-05  2.313947 -4.424578  3.854731 -4.411392
2013-01-06  1.213328 -3.279855  4.756322 -3.908898

Cumulative sum by column

print(df.apply(np.cumsum,axis=1))

                   A         B         C         D
2013-01-01  1.624345  1.012589  0.484417 -0.588551
2013-01-02  0.865408 -1.436131  0.308681 -0.452526
2013-01-03  0.319039  0.069669  1.531777 -0.528364
2013-01-04 -0.322417 -0.706472  0.427298 -0.672593
2013-01-05 -0.172428 -1.050287 -1.008073 -0.425258
2013-01-06 -1.100619  0.044105  0.945695  1.448190

Find the difference between the maximum and minimum values by row

print(df.apply(lambda x: x.max() - x.min(),axis=0))

A    2.724965
B    3.446262
C    2.272984
D    2.642956
dtype: float64

Difference between maximum and minimum values by column

print(df.apply(lambda x: x.max() - x.min(),axis=1))

2013-01-01    2.697314
2013-01-02    4.046350
2013-01-03    3.522249
2013-01-04    2.233661
2013-01-05    1.460674
2013-01-06    2.245343
Freq: D, dtype: float64

Count the number of occurrences of each value

s = pd.Series(np.random.randint(0, 7, size=10))
# Default index
print(s)
print(s.value_counts())
0    5
1    3
2    1
3    2
4    0
5    4
6    1
7    2
8    2
9    1
dtype: int32

2    3
1    3
5    1
4    1
3    1
0    1
dtype: int64

String operation

Convert string to lowercase

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
print(s)
print(s.str.lower())
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

merge

Merge by row

df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
print(pd.concat(pieces))
# The merger is the same as before

Merge by specified column

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print(left)
print(right)
print(pd.merge(left, right, on="key"))
   key  lval
0  foo     1
1  bar     2

   key  rval
0  foo     4
1  bar     5

   key  lval  rval
0  foo     1     4
1  bar     2     5

grouping

  1. Group data according to some criteria
  2. Apply a single function to each group
  3. Merge results into data structure
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8)
    })

print(df)

print(df.groupby("A").sum())
print(df.groupby(["A", "B"]).sum())

            C         D
A                      
bar -3.986264 -2.693565
foo  2.945186  1.492608

                  C         D
A   B                        
bar one   -0.611756 -0.249370
    three -1.072969 -2.060141
    two   -2.301539 -0.384054
foo one    3.369157  1.452809
    three -0.761207 -1.099891
    two    0.337236  1.139691

For other more complex uses, see https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook

Topics: Python Permutation pandas