[Python] basic learning notes of DS 11: Pandas Library

Posted by edtlov on Mon, 03 Jan 2022 13:18:21 +0100

Pandas Library

Introduction

Numpy performs well in vectorized numerical calculations
But when dealing with more flexible and complex data tasks:
Such as adding labels to data, dealing with missing values, grouping and PivotTables
Numpy seems powerless
The Pandas Library Based on Numpy provides advanced data structure and operation tools that make data analysis faster and simpler

11.1 object creation

11.1.1 Pandas Series object

Series is a one-dimensional array of labeled data

Creation of Series objects

General structure: PD Series(data, index=index, dtype=dtype)
Data: data, which can be a list, dictionary or Numpy array
Index: index, an optional parameter
dtype: data type; optional parameter

1 create with list

index defaults to integer sequence

import pandas as pd


data = pd.Series([1.5, 3, 4.5, 6])
print(data)

0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64

Add index

data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
print(data)

a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64

Add data type
By default, it is automatically judged from the incoming data

data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype=float)
print(data)

a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64

Note: data supports multiple data types

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
print(data)

a 1
b 2
c 3
d 4
dtype: object

print(data["a"])
print(data["c"])

1
3

The data type can be forcibly changed

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
print(data)

a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64

print(data["c"])

3.0

If it is not castable, an error will be reported

data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
print(data)

2 create with one-dimensional Numpy array

import pandas as pd
import numpy as np

x = np.arange(5)
y = pd.Series(x)
print(y)

0 0
1 1
2 2
3 3
4 4
dtype: int32

3 create with dictionary

The default key is index and the value is data

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict)
print(population)

BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64

When creating a dictionary, if index is specified, it will be filtered from the keys in the dictionary. If it cannot be found, the value will be set to NaN

population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])
print(population)

BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64

4. When data is scalar

print(pd.Series(5, index=[100, 200, 300]))

100 5
200 5
300 5
dtype: int64

11.1. 2. Pandas dataframe object

DataFrame is a multidimensional array of labeled data

Creation of DataFrame object

General structure: PD DataFrame(data, index=index, columns=columns)
Data: data, which can be a list, dictionary or Numpy array
Index: index, an optional parameter
columns: column label, optional parameter

1 create through Series objects

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981}
population = pd.Series(population_dict)
print(pd.DataFrame(population))

print(pd.DataFrame(population, columns=["population"]))

2 create through Series object dictionary

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981}
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468}
GDP = pd.Series(GDP_dict)
print(pd.DataFrame({"population": population,
                    "GDP": GDP}))

Note: if the quantity is insufficient, it will be supplemented automatically

print(pd.DataFrame({"population": population,
                    "GDP": GDP,
                    "Country": "China"}))

3 create through dictionary list object

The dictionary index is used as index and the dictionary key is used as columns

data = [{"a": i, "b": 2*i} for i in range(3)]
print(data)
print(pd.DataFrame(data))

Keys that do not exist will default to NaN

data = [{"a": 1, "b": 1}, {"b":3, "c": 4}]
print(pd.DataFrame(data))

4 created by Numpy two-dimensional array

print(pd.DataFrame(np.random.randint(10, size=(3, 2)), columns=["foo", "bar"], index=["a", "b", "c"]))

11.2 properties of dataframe

11.2. 1 Properties

data = pd.DataFrame({"population":population, "GDP": GDP})
print(data)

1. df.values returns the data represented by the numpy array

print(data.values)

[[ 2154 30320]
[ 2424 32680]
[ 1303 24222]
[ 981 13468]]

2. df.index returns the row index

print(data.index)

Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

3. df.columns returns the column index

print(data.columns)

Index(['population', 'GDP'], dtype='object')

4. df.shape shape

print(data.shape)

(4, 2)

5. pd.size size

print(data.size)

6. pd.dtypes returns the data type of each column

print(data.dtypes)

population int64
GDP int64
dtype: object

11.2. 2 index

1. Get columns

Dictionary type

print(data["population"])

BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: population, dtype: int64

print(data[["GDP", "population"]])  # Multiple columns are obtained in list form

Object attribute

print(data.GDP)

BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64

2. Get row

Absolute index DF loc

print(data.loc["BeiJing"])

population 2154
GDP 30320
Name: BeiJing, dtype: int64

print(data.loc[["BeiJing", "HangZhou"]])

Relative index

print(data.iloc[0])

population 2154
GDP 30320
Name: BeiJing, dtype: int64

print(data.iloc[[1, 3]])

3. Get scalar

print(data.loc["BeiJing", "GDP"])

30320

print(data.iloc[0, 1])

30320

print(data.values[0][1])

30320

4. Index of series objects

print(type(data.GDP))

print(GDP["BeiJing"])

30320

11.2. 3 slice

dates = pd.date_range(start='2019-01-01', periods=6)
print(dates)

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=["A", "B", "C", "D"])
print(df)

1. Row slicing

print(df["2019-01-01": "2019-01-03"])
print(df.loc["2019-01-01": "2019-01-03"])
print(df.iloc[0:3])  # End position not included

2. Column slicing

print(df.loc[:, "A": "C"])
print(df.iloc[:, 0: 3])

3. Various values

Row and column slicing at the same time

print(df.loc["2019-01-02": "2019-01-03", "C": "D"])  # Row and column simultaneous slicing
print(df.iloc[1: 3, 2:])

Row slice, column scatter value

print(df.loc["2019-01-04": "2019-01-06", ["A", "C"]])  # Row slice, column scatter value
print(df.iloc[3:, [0, 2]])

Row scatter value, column slice

print(df.loc[["2019-01-04", "2019-01-06"], "C": "D"])  # Row scatter value, column slice
print(df.iloc[[3, 5], 2:])

Row and column values are scattered

print(df.loc[["2019-01-04", "2019-01-06"], ["C", "D"]])  # Row and column values are scattered
print(df.iloc[[3, 5], [2, 3]])

11.2. 4 Boolean index

print(df)

print(df > 0)

print(df[df > 0])

print(df.A > 0)

print(df[df.A > 0])

isin() method

df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)

ind = df2['E'].isin(["two", "four"])
print(ind)
print(df2[ind])

11.2. 5 assignment

print(df)

Add new column to DataFrame

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
print(s1)

2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64

df['E'] = s1
print(df)

Modify assignment

df.loc["2019-01-01", "A"] = 0
print(df)

df.iloc[0, 1] = 0
print(df)

df["D"] = np.array([5]*len(df))  # It can be simplified to df["D"] = 5
print(df)

Modify index and columns

df.index = [i for i in range(len(df))]
df.columns = [i for i in range(df.shape[1])]
print(df)

11.3 numerical calculation and statistical analysis

11.3. 1. View data

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=["A", "B", "C", "D"])
print(df)

1. View the previous line

print(df.head())  # Default 5 lines

print(df.head(2))

2. View the following lines

print(df.tail())  # Default 5 lines

print(df.tail(3))

3. View general information
In order to view the overall information, we first change one of the elements to NaN

df.iloc[0, 3] = np.nan
print(df)

print(df.info())

11.3.2 Numpy general function is also applicable to Pandas

1 vectorization operation

x = pd.DataFrame(np.arange(4).reshape(1, 4))
print(x)

print(x + 5)

print(np.exp(x))

y = pd.DataFrame(np.arange(4, 8).reshape(1, 4))
print(x * y)

2 matrix operation

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(10, 10)))
print(x)

Transpose

z = x.T
print(z)

Matrix multiplication

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(10, 10)))
print(x.dot(y))

We can use the% timeit method to compare the time required for the next two kinds of multiplication

%timeit x.dot(y)
%timeit np.dot(x, y)

NP can be found Dot (x, y) takes less time

Perform the same operation, comparison between Numpy and Pandas

In general, pure computing is faster in Numpy
Numpy focuses more on calculation and Pandas focuses more on data analysis

3 broadcast operation

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
print(x)

Spread by line

print(x.iloc[0])

A 6
B 3
C 7
Name: 0, dtype: int32

print(x/x.iloc[0])

Column propagation

print(x.div(x.A, axis=0))  # add sub div mul

Here, if axis is set to 1, it means propagation by column. You can also explicitly write out that axis=0 means propagation by row.

11.3. 3 new usage

1 Index alignment

np.random.seed(42)
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
print(A)
print(B)

Pandas will automatically align the indexes of the two objects, and NP. Is used for values without Nan representation

print(A+B)

The default value can be filled with fill value

print(A.add(B, fill_value=0))

2 statistical correlation

Data type statistics

y = np.random.randint(3, size=20)
print(y)

[2 0 2 2 0 0 2 1 2 2 2 2 0 2 1 0 1 1 1 1]

print(np.unique(y))

[0 1 2]

from collections import Counter
print(Counter(y))

Counter({2: 9, 1: 6, 0: 5})

y1 = pd.DataFrame(y, columns=["A"])
print(y1)

print(np.unique(y1))

[0 1 2]

print(y1["A"].value_counts())

2 9
1 6
0 5
Name: A, dtype: int64

Generate new results and sort them

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981}
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468}
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population, "GDP": GDP})
print(city_info)

city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
print(city_info)

sort ascending

print(city_info.sort_values(by="per_GDP"))

sort descending

print(city_info.sort_values(by="per_GDP", ascending=False))

Sort by axis

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
print(data)

Row sorting

print(data.sort_index())

Column sorting

print(data.sort_index(axis=1))

statistical method

df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)), columns=list("ABCD"))
print(df)

Non empty number

df.count()

A 6
B 6
C 6
D 6
dtype: int64

Summation (default summation of columns)

df.sum()

A 13.951465
B 5.381398
C 8.792280
D 5.701582
dtype: float64

df.sum(axis=1)  # Sum of rows

0 18.116673
1 15.513429
2 4.575753
3 -7.834093
4 -4.075646
5 7.530610
dtype: float64

Max min

df.min()  # Default minimum value of each column

A -2.051324
B -5.653121
C -4.899671
D -3.698993
dtype: float64

df.max(axis=1)  # Maximum value of each row after shaft replacement

0 8.092119
1 8.316851
2 4.170240
3 2.967849
4 3.256989
5 7.862595
dtype: float64

print(df.idxmax())  # Maximum coordinates

A 5
B 2
C 1
D 0
dtype: int64

mean value

df.mean()

A 2.325244
B 0.896900
C 1.465380
D 0.950264
dtype: float64

variance

df.var()

A 11.887325
B 11.911567
C 21.841270
D 22.569365
dtype: float64

standard deviation

df.std()

A 3.447800
B 3.451314
C 4.673464
D 4.750723
dtype: float64

median

df.median()

A 2.015618
B 1.271919
C 1.208221
D -0.056035
dtype: float64

Mode

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
print(data)

data.mode()

75% quantile

df.quantile(0.75)

A 3.732105
B 2.804478
C 4.010594
D 3.836574
Name: 0.75, dtype: float64

catch all in one draft

df.describe()

The character type also has a describe

data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
print(data_2)
print(data_2.describe())

Correlation coefficient and covariance

print(df.corr())

print(df.corrwith(df["A"]))

Custom output
Usage of apply(method): use the method to perform corresponding operations on each column by default

df.apply(np.cumsum)

print(df.apply(np.cumsum, axis=1))

df.apply(sum)

A 13.951465
B 5.381398
C 8.792280
D 5.701582
dtype: float64

df.apply(lambda x: x.max()-x.min())

A 9.913920
B 9.823361
C 13.216523
D 11.791112
dtype: float64

def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])


print(df.apply(my_describe))

11.4 missing value handling

11.4. 1 missing value found

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
print(data)

Note: there are None, string, etc. the data type is changed to object, which consumes more resources than int and float

data.dtypes

A object
B object
C object
dtype: object

data.isnull()

data.notnull()

11.4. 2 delete missing values

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11, 12, 13]]), columns=["A", "B", "C", "D"])
print(data)

Note: NP Nan is a special floating point number

data.dtypes

1. Delete the whole line

data.dropna()

2. Delete the whole column

data.dropna(axis="columns")

data["D"] = np.nan
print(data)

data.dropna(axis="columns", how="all")  # Delete when all elements are missing. any means delete when there is a missing element

11.4. 3 fill in missing values

data.fillna(value=5)

Replace with mean

fill = data.mean()
print(data.fillna(value=fill))

If all average values are used instead, the average value can be taken after stack leveling

fill = data.stack().mean()
print(data.fillna(value=fill))

11.5 consolidated data

Construct a function to produce DataFrame

def make_df(cols, ind):
    """A simple DataFrame"""
    data = {c: [str(c)+str(i) for i in ind] for c in cols}
    print(data)
    return pd.DataFrame(data, ind)


make_df("ABC", range(3))

Vertical merge
If the row labels overlap, the overlapping row labels are retained

df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)

print(pd.concat([df_1, df_2]))

Horizontal merge
If the column labels overlap, the overlapping column labels are retained

df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)

print(pd.concat([df_3, df_4], axis=1))

Index overlap
When labels overlap, we sometimes want labels to be rearranged, so we need to add the parameter ignore_index=True

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
print(pd.concat([df_5, df_6], ignore_index=True))

The same is true for column overlap. Add the parameter axis=1

Align merge()
merge() can be used to merge when there are corresponding same values (common columns)

df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)

We can see column B in df_9 and DF_ The correspondence in 10 is equal. We want column B to remain unchanged when merging

print(pd.merge(df_9, df_10))

[example] merge city information

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
print(population)

GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
print(GDP)

city_info = pd.merge(population, GDP)
print(city_info)

If you don't want to abandon Shanghai and Shenzhen, you need to add parameters

city_info = pd.merge(population, GDP, how="outer")
print(city_info)

11.6 grouping and PivotTables

df = pd.DataFrame({"key": ["A", "B", "C", "C", "B", "A"],
                   "data1": range(6),
                   "data2": np.random.randint(0, 10, size=6)})
print(df)

11.6. 1 grouping

Delay calculation (save first and wait for processing)

df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EDFF4444C0>

df.groupby("key").sum()

df.groupby("key").mean()

So what exactly is groupby?

for i in df.groupby("key"):
    print(str(i))

Value by column

df.groupby("key")["data2"].sum()

key
A 15
B 9
C 11
Name: data2, dtype: int32

Iterate by group

for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))

Call method

df.groupby("key")["data1"].describe()

Support for more complex operations

df.groupby("key").aggregate(["min", "median", "max"])

filter

np.random.seed(1)
df = pd.DataFrame({"key": ["A", "B", "C", "C", "B", "A"],
                   "data1": range(6),
                   "data2": np.random.randint(0, 10, size=6)})
print(df)


def filter_func(x):
    return x["data2"].std() > 3


print(df.groupby("key")["data2"].std())
print(df.groupby("key").filter(filter_func))

transformation

df = pd.DataFrame({"key": ["A", "B", "C", "C", "B", "A"],
                   "data1": range(6),
                   "data2": np.random.randint(0, 10, size=6)})
print(df)
print(df.groupby("key").transform(lambda x: x - x.mean()))

apply method

def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x

print(df.groupby("key").apply(norm_by_data2))

Set list and array as grouping key

print(df)
L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())

Mapping indexes to groups with dictionaries

df2 = df.set_index("key")
print(df2)
mapping = {"A": "first", "B": "constant", "C": "constant"}
print(df2.groupby(mapping).sum())

Any Python function

print(df2.groupby(str.lower).mean())

A list of valid values

print(df2.groupby([str.lower, mapping]).mean())

[example 1] planetary observation data processing

import seaborn as sns


planets = sns.load_dataset('planets')
print(planets.shape)  # Get shape
print(planets.head())  # View the first five lines
print(planets.describe())  # statistical information

Now we hope to realize the observation quantity in different times (decade, unit) under different observation methods by grouping.
First, we should define the era.

decade = 10 * (planets["year"]//10)
print(decade.head())
decade = decade.astype(str) + "s"
decade.name = "decade"
print(decade.head())

After definition, we group the method and decade.

print(planets.groupby(["method", decade]).sum())

In fact, we only need the data in the column number, so we can propose number separately, expand it with unstack to make it more beautiful, and finally fill the missing value with 0 with fillna.

print(planets.groupby(["method", decade])["number"].sum().unstack().fillna(0))

11.6. 2 PivotTable report

[example 2] analysis of Titanic passenger data

import seaborn as sns


titanic = sns.load_dataset("titanic")
print(titanic.head())
print(titanic.describe())

Now we want to get the survival rate of different genders.

print(titanic.groupby("sex")["survived"].mean())

sex
female 0.742038
male 0.188908
Name: survived, dtype: float64

The index "served" is expressed as Series in one bracket and DataFrame in two brackets

print(titanic.groupby("sex")[["survived"]].mean())

More specifically, we want to see the survival of different cabins under different genders.

print(titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack())

We found that writing by groupby alone is complex and unreadable, but Pandas provides a PivotTable method. Let's take a look at how to use PivotTable to achieve the same function.

print(titanic.pivot_table("survived", index="sex", columns="class"))

We can find that we can use pivot directly_ The table method returns the average value of data by default.
We can also add other parameters to get different results.
aggfunc to change the method and margins to set the overview.

print(titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True))

When we use pivot_ If you want to process multiple types of data, you can select the corresponding methods in aggfuc. For example, in this example, we not only want to get the number of survival, but also need the average ticket price.

print(titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}))

11.7 others

Vectorization string operation
Processing time series
Multilevel index: used for multidimensional arrays

base_data = np.array([[1771, 11115],
                      [2154, 30320],
                      [2141, 13070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13486]])
data = pd.DataFrame(base_data, index=[
    ["BeiJing", "BeiJing", "ShangHai", "ShangHai", "ShenZhen", "ShenZhen", "HangZhou", "HangZhou"],
    [2008, 2018] * 4], columns=["population", "GDP"])
print(data)

Modify row index name

data.index.names = ["city", "year"]
print(data)

Value of column data

print(data["GDP"])

Take GDP of a city

print(data.loc["ShangHai", "GDP"])

Take the situation of a city in a year

print(data.loc["ShangHai", 2018])

High performance Pandas: eval()

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)

17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The memory allocation of the intermediate process of compound algebraic calculation is reduced, and the running speed is improved

%timeit pd.eval("(df1+df2)/(df3+df4)")

10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It can be compared that the operation result is the same

np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))

True

Implement inter column operation

df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()

df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()

Use local variables

column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()

High performance Pandas: query()

df.head()

%timeit df[(df.A < 0.5) & (df.B > 0.5)]

1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.query("(A < 0.5)&(B > 0.5)")

2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Instead, the time becomes longer. The reason is that the amount of data is too small, resulting in the high time cost of calling the method itself

Use timing of eval() and query()
When the array is small, the normal method is faster
Therefore, when the amount of data is large, eval and query are called to improve the running speed

df.values.nbytes

32000

df1.values.nbytes

8000000

The above is the in-depth exploration in section 11. Pandas is an advanced tool that focuses more on data analysis.
The next section takes a closer look at the Matplotlib library.

Topics: Python

Programmer Think

[Python] basic learning notes of DS 11: Pandas Library

Pandas Library

Introduction

11.1 object creation

11.1.1 Pandas Series object

1 create with list

2 create with one-dimensional Numpy array

3 create with dictionary

4. When data is scalar

11.1. 2. Pandas dataframe object

1 create through Series objects

2 create through Series object dictionary

3 create through dictionary list object

4 created by Numpy two-dimensional array

11.2 properties of dataframe

11.2. 1 Properties

11.2. 2 index

11.2. 3 slice

11.2. 4 Boolean index

11.2. 5 assignment

11.3 numerical calculation and statistical analysis

11.3. 1. View data

11.3.2 Numpy general function is also applicable to Pandas

1 vectorization operation

2 matrix operation

3 broadcast operation

11.3. 3 new usage

1 Index alignment

2 statistical correlation

11.4 missing value handling

11.4. 1 missing value found

11.4. 2 delete missing values

11.4. 3 fill in missing values

11.5 consolidated data

11.6 grouping and PivotTables

11.6. 1 grouping

11.6. 2 PivotTable report

11.7 others

Hot Topics