Translated from https://pandas.pydata.org/docs/user_guide/10min.html
Use version 0.23.0
pandas is a fast, powerful, flexible and easy-to-use open source data analysis and operation tool, which is built on the Python programming language.
pandas contains two structures
- Series is a one-dimensional tag array that can hold any data type (integer, string, floating point number, Python object, etc.)
- DataFrame is a two-dimensional labeled data structure with different types of columns. You can think of it as a spreadsheet or SQL table, or as a dictionary of Series objects. More commonly used than Series.
create object
Pass in a list and create a Series object with the default integer index
Series
import numpy as np import pandas as pd s = pd.Series([1, 3, 5, np.nan, 6, 8]) print(s)
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
DataFrame
Pass in a NumPy array, take the dates time as the index, set the corresponding column tag columns, and create a DataFrame object
import numpy as np import pandas as pd dates = pd.date_range("20130101", periods=6) print(dates) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD")) print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') A B C D 2013-01-01 -1.072976 -0.935738 -0.239158 -0.294826 2013-01-02 -1.127384 -0.567880 0.182306 -2.934319 2013-01-03 0.544147 0.183749 -1.041106 -1.440546 2013-01-04 0.115221 -0.152783 -1.984568 -0.561250 2013-01-05 1.126787 -0.290174 0.340003 -2.309574 2013-01-06 -0.560692 -0.647581 -0.563737 -0.377166
Pass in the dictionary. The default index is. The key of the dictionary is the corresponding column name. Create a DataFrame object
Different columns in the DataFrame can have different dtype s
f2 = pd.DataFrame( { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "D": np.array([3] * 4, dtype="int32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } ) print(f2) print(f2.dtypes)
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
View object
View the top and bottom rows
View the top and bottom rows in the DataFrame
f2.head(2) # 2 is the number of rows. It is not specified to view all data by default A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo
f2.tail(2) # 2 is the number of rows. It is not specified to view all data by default A B C D E F 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
View index and column names
f2.index Int64Index([0, 1, 2, 3], dtype='int64')
f2.columns Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Obtain statistics
Returns the maximum value, minimum value, median, standard deviation, etc. of some columns (number type) in the DataFrame
print(f2.describe()) # Only ACD columns of numeric types float64, float32, int32 are processed A C D count 4.0 4.0 4.0 mean 1.0 1.0 3.0 std 0.0 0.0 0.0 min 1.0 1.0 3.0 25% 1.0 1.0 3.0 50% 1.0 1.0 3.0 75% 1.0 1.0 3.0 max 1.0 1.0 3.0
Transpose
print(df.T)
sort
raw data:
A B C D 2013-01-01 0.187485 -2.743918 1.000627 1.730537 2013-01-02 -0.248065 0.051679 1.200797 2.189407 2013-01-03 -1.029446 0.525787 -0.737382 1.494209 2013-01-04 -0.882341 -1.104881 0.277582 0.154353 2013-01-05 0.332548 -0.874347 -0.188176 -0.546178 2013-01-06 -0.237779 0.196372 -0.304217 -0.596813
- Sort by axis. axis=1 means sort by column, axis=0 means sort by row, and ascending means ascending
print(df.sort_index(axis=1, ascending=False)) D C B A 2013-01-01 1.730537 1.000627 -2.743918 0.187485 2013-01-02 2.189407 1.200797 0.051679 -0.248065 2013-01-03 1.494209 -0.737382 0.525787 -1.029446 2013-01-04 0.154353 0.277582 -1.104881 -0.882341 2013-01-05 -0.546178 -0.188176 -0.874347 0.332548 2013-01-06 -0.596813 -0.304217 0.196372 -0.237779
- Sort by the value of a column
print(df.sort_values(by="B")) A B C D 2013-01-01 0.187485 -2.743918 1.000627 1.730537 2013-01-04 -0.882341 -1.104881 0.277582 0.154353 2013-01-05 0.332548 -0.874347 -0.188176 -0.546178 2013-01-02 -0.248065 0.051679 1.200797 2.189407 2013-01-06 -0.237779 0.196372 -0.304217 -0.596813 2013-01-03 -1.029446 0.525787 -0.737382 1.494209
Select data
raw data:
A B C D 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 2013-01-05 -0.172428 -0.877858 0.042214 0.582815 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 2013-01-03 0.319039 -0.249370 1.462108 -2.060141 2013-01-06 -1.100619 1.144724 0.901591 0.502494
Select a column of data according to the column name and return the Series type
Select data by column
print(df["A"]) 2013-01-01 1.624345 2013-01-02 0.865408 2013-01-03 0.319039 2013-01-04 -0.322417 2013-01-05 -0.172428 2013-01-06 -1.100619 Freq: D, Name: A, dtype: float64
Select data by column name, select multiple columns, and return DataFrame type
print(df.loc[:, ["A", "B"]]) A B 2013-01-01 1.624345 -0.611756 2013-01-02 0.865408 -2.301539 2013-01-03 0.319039 -0.249370 2013-01-04 -0.322417 -0.384054 2013-01-05 -0.172428 -0.877858 2013-01-06 -1.100619 1.144724
Select data by row
Select a row of data by row name
print(df.loc["2013-01-03 "]) A 0.319039 B -0.249370 C 1.462108 D -2.060141 Name: 2013-01-03 00:00:00, dtype: float64
Select multiple rows of data by row index
print(df[0:2]) A B C D 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 2013-01-02 0.865408 -2.301539 1.744812 -0.761207
Select data by row and column
Specify a row name and a column name, and select a value in the Dataframe
print(df.loc[dates[0], "A"]) print(df.at[dates[0], "A"]) # The above two sentences have the same effect 1.6243453636632417 1.6243453636632417
Specify a row name and multiple column names and select data
print(df.loc["20130102", ["A", "B"]]) A 0.865408 B -2.301539 Name: 2013-01-02 00:00:00, dtype: float64
Specify the start row name and end row name, specify multiple column names, and select data
print(df.loc["20130102":"20130104", ["A", "B"]]) A B 2013-01-02 0.865408 -2.301539 2013-01-03 0.319039 -0.249370 2013-01-04 -0.322417 -0.384054
Select data by location
Select the row with index 3, the fourth row
print(df.iloc[3]) # Same as df.loc["2013-01-04"] # Cannot use df[3] A -0.322417 B -0.384054 C 1.133769 D -1.099891 Name: 2013-01-04 00:00:00, dtype: float64
Select columns 1 and 2 of the rows with indexes 3 and 4
print(df.iloc[3:5, 0:2]) A B 2013-01-04 -0.322417 -0.384054 2013-01-05 -0.172428 -0.877858
Select columns 1 and 3 of the rows with indexes 1, 2, and 4
print(df.iloc[[1, 2, 4], [0, 2]]) A C 2013-01-02 0.865408 1.744812 2013-01-03 0.319039 1.462108 2013-01-05 -0.172428 0.042214
All columns or rows are selected by default
print(df.iloc[1:3, :]) A B C D 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 2013-01-03 0.319039 -0.249370 1.462108 -2.060141
print(df.iloc[:, 1:3]) B C 2013-01-01 -0.611756 -0.528172 2013-01-02 -2.301539 1.744812 2013-01-03 -0.249370 1.462108 2013-01-04 -0.384054 1.133769 2013-01-05 -0.877858 0.042214 2013-01-06 1.144724 0.901591
Specify a row or column and select a value
print(df.iloc[1, 1]) print(df.iat[1, 1]) -2.3015386968802827 -2.3015386968802827
Boolean index
Data is selected according to the true and false Boolean values
Select the value corresponding to column A in df that is greater than 0
print(df[df["A"] > 0]) A B C D 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 2013-01-03 0.319039 -0.249370 1.462108 -2.060141
Select data greater than 0 in df, and use NaN for other data
A B C D 2013-01-01 1.624345 NaN NaN NaN 2013-01-02 0.865408 NaN 1.744812 NaN 2013-01-03 0.319039 NaN 1.462108 NaN 2013-01-04 NaN NaN 1.133769 NaN 2013-01-05 NaN NaN 0.042214 0.582815 2013-01-06 NaN 1.144724 0.901591 0.502494
Change original data
Add a column named F to df. Through index connection, the unmatched ones are replaced by nan
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6)) print(s1) df["F"] = s1 print(df)
2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64 A B C D F 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 NaN 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0 2013-01-03 0.319039 -0.249370 1.462108 -2.060141 2.0 2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 3.0 2013-01-05 -0.172428 -0.877858 0.042214 0.582815 4.0 2013-01-06 -1.100619 1.144724 0.901591 0.502494 5.0
Change the original value in df
df.at[dates[0], "A"] = 0 df.iat[0, 1] = 0 df.loc[:, "D"] = np.array([5] * len(df)) print(df)
A B C D 2013-01-01 0.000000 0.000000 -0.528172 5 2013-01-02 0.865408 -2.301539 1.744812 5 2013-01-03 0.319039 -0.249370 1.462108 5 2013-01-04 -0.322417 -0.384054 1.133769 5 2013-01-05 -0.172428 -0.877858 0.042214 5 2013-01-06 -1.100619 1.144724 0.901591 5
Data missing processing
Pandas mainly uses the value np.nan to indicate missing data. It does not participate in the calculation by default
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"]) df1.loc[dates[0] : dates[1], "E"] = 1 print(df1)
A B C D E 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0 2013-01-03 0.319039 -0.249370 1.462108 -2.060141 NaN 2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 NaN
Remove rows that contain missing values
print(df1.dropna(axis=0, how="any")) A B C D E 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0
Remove columns that contain missing values
print(df1.dropna(axis=1, how="any")) A B C D 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 2013-01-03 0.319039 -0.249370 1.462108 -2.060141 2013-01-04 -0.322417 -0.384054 1.133769 -1.099891
Set the missing value to a value
print(df1.fillna(value=5)) A B C D E 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0 2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0 2013-01-03 0.319039 -0.249370 1.462108 -2.060141 5.0 2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 5.0
Get the mask matrix for missing values
print(pd.isna(df1)) A B C D E 2013-01-01 False False False False False 2013-01-02 False False False False False 2013-01-03 False False False False True 2013-01-04 False False False False True
operation
Average by row
print(df.mean(0)) A 0.202221 B -0.546642 C 0.792720 D -0.651483 dtype: float64
Average by column
print(df.mean(1)) 2013-01-01 -0.147138 2013-01-02 -0.113132 2013-01-03 -0.132091 2013-01-04 -0.168148 2013-01-05 -0.106314 2013-01-06 0.362047 Freq: D, dtype: float64
The data in s is automatically broadcast along the specified dimension into df the data of the same dimension, and then subtracted from the corresponding position
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) print(s) print(df.sub(s, axis="index"))
2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 A B C D 2013-01-01 NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN 2013-01-03 -0.680961 -1.249370 0.462108 -3.060141 2013-01-04 -3.322417 -3.384054 -1.866231 -4.099891 2013-01-05 -5.172428 -5.877858 -4.957786 -4.417185 2013-01-06 NaN NaN NaN NaN
Applies a function to the data of the specified dimension
Cumulative sum by row
print(df.apply(np.cumsum,axis=0)) A B C D 2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 2013-01-02 2.489753 -2.913295 1.216640 -1.834176 2013-01-03 2.808792 -3.162665 2.678748 -3.894316 2013-01-04 2.486375 -3.546720 3.812517 -4.994207 2013-01-05 2.313947 -4.424578 3.854731 -4.411392 2013-01-06 1.213328 -3.279855 4.756322 -3.908898
Cumulative sum by column
print(df.apply(np.cumsum,axis=1)) A B C D 2013-01-01 1.624345 1.012589 0.484417 -0.588551 2013-01-02 0.865408 -1.436131 0.308681 -0.452526 2013-01-03 0.319039 0.069669 1.531777 -0.528364 2013-01-04 -0.322417 -0.706472 0.427298 -0.672593 2013-01-05 -0.172428 -1.050287 -1.008073 -0.425258 2013-01-06 -1.100619 0.044105 0.945695 1.448190
Find the difference between the maximum and minimum values by row
print(df.apply(lambda x: x.max() - x.min(),axis=0)) A 2.724965 B 3.446262 C 2.272984 D 2.642956 dtype: float64
Difference between maximum and minimum values by column
print(df.apply(lambda x: x.max() - x.min(),axis=1)) 2013-01-01 2.697314 2013-01-02 4.046350 2013-01-03 3.522249 2013-01-04 2.233661 2013-01-05 1.460674 2013-01-06 2.245343 Freq: D, dtype: float64
Count the number of occurrences of each value
s = pd.Series(np.random.randint(0, 7, size=10)) # Default index print(s) print(s.value_counts())
0 5 1 3 2 1 3 2 4 0 5 4 6 1 7 2 8 2 9 1 dtype: int32 2 3 1 3 5 1 4 1 3 1 0 1 dtype: int64
String operation
Convert string to lowercase
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) print(s) print(s.str.lower())
0 A 1 B 2 C 3 Aaba 4 Baca 5 NaN 6 CABA 7 dog 8 cat dtype: object 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
merge
Merge by row
df = pd.DataFrame(np.random.randn(10, 4)) pieces = [df[:3], df[3:7], df[7:]] print(pd.concat(pieces)) # The merger is the same as before
Merge by specified column
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]}) right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) print(left) print(right) print(pd.merge(left, right, on="key"))
key lval 0 foo 1 1 bar 2 key rval 0 foo 4 1 bar 5 key lval rval 0 foo 1 4 1 bar 2 5
grouping
- Group data according to some criteria
- Apply a single function to each group
- Merge results into data structure
df = pd.DataFrame( { "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], "B": ["one", "one", "two", "three", "two", "two", "one", "three"], "C": np.random.randn(8), "D": np.random.randn(8) }) print(df) print(df.groupby("A").sum()) print(df.groupby(["A", "B"]).sum())
C D A bar -3.986264 -2.693565 foo 2.945186 1.492608 C D A B bar one -0.611756 -0.249370 three -1.072969 -2.060141 two -2.301539 -0.384054 foo one 3.369157 1.452809 three -0.761207 -1.099891 two 0.337236 1.139691
For other more complex uses, see https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook