Detailed explanation of DataFrame of Pandas

Posted by troy_mccormick on Wed, 09 Mar 2022 13:10:02 +0100

Two dimensional data, Series container, with both row index and column index

1. Create DataFrame

1.1 create DataFrame through list

You need to specify data, index row and columns

Specify that data and index/columns are of type list or NP arange

df1 = pd.DataFrame(data=[[1, 2, 3], [11, 12, 13]], index=['r_1', 'r_2'], columns=['A', 'B', 'C'])
df2 = pd.DataFrame(data=[[1], [11]], index=['r_1', 'r_2'], columns=['A'])
df3 = pd.DataFrame(data=np.arange(12).reshape(3, 4), index=list("abc"), columns=list("ABCD"))
Copy code

A B C

r_1 1 2 3 r_2 11 12 13

A r_1 1 r_2 11

A B C D a 0 1 2 3 b 4 5 6 7 c 8 9 10 11

1.2 create DataFrame through dictionary

1.2.1 method 1: input a single dictionary. Note that it must be one key with multiple values (in case of single value, it must also add [])

dict = {"name": ["jack", "HanMeimei"], "age": ["100", "100"]}
# dict = {"name": "jack", "age": "100"}#Writing like this will report an error
# dict = {"name":["jack"], "age": ["100"]}#If it is a single value, you must add []
df3 = pd.DataFrame(dict, index=list("ab"))
Copy code

age age1 name

a 100.0 NaN MaYun1 b 100.0 NaN MaYun2 c NaN 100.0 MaYun3

1.2.2 method 2: input the dictionary list. Each dictionary is a row of data, and the missing column will supplement nan

dict = [{"name": "MaYun1", "age": 100}, {"name": "MaYun2", "age": 100}, {"name": "MaYun3", "age1": 100}]
# dict = {"name": "jack", "age": "100"}
df4 = pd.DataFrame(dict, index=list("abc"))
Copy code

2. Basic properties of dataframe

dict = {"name": ["jack", "HanMeimei", "Lucy"], "age": ["100", "90","98"], "salary": [30000, 50000, 999000]}
df5 = pd.DataFrame(dict)
print(df5)
print(df5.head(1))
print(df5.tail(1))
print(df5.info())
print(df5.index)
print(df5.columns)
print(df5.values)
print(df5.describe())
Copy code

3. All data are sorted according to the specified column

df5 = df5.sort_values(by='salary', ascending=True)
print(df5)
Copy code

4. DataFrame simple row and column slicing

dict = {"name": ["jack", "HanMeimei", "Lucy","Mr Green", "Mrs Han", "Lily"],
        "age": [100, 90,98,90,100,30], "salary": [30000, 50000, 999000,90000,80000,75000]}
df6 = pd.DataFrame(dict)
print(df6)

# Take out the first five lines
print(df6[0:5])
# Take out the name column
print(df6["name"])
# Take out the name column of the first three rows
print(df6[0:3]["name"])
Copy code

5. loc row and column slicing

5.2 * just remember the following one. It's troublesome to remember too much

The basic format is:

df7.loc[that 's ok,column]
Copy code

If you take consecutive rows or columns --- use slice:

If you take out discontinuous rows or columns - use list []

Slice and list can be mixed

Examples:

5.5.1 continuous multi row and multi column

df7.loc['a':'c','name':'age']

Note: includes b OK, because it's a row slice
>       name  age
a       jack  100
b  HanMeimei   90
c       Lucy   98
 Copy code

5.5.2 discontinuous multiple rows + continuous multiple columns

df7.loc[['a','c'],'name':'salary']
Note: the row is a discontinuous selection, only a and c
	 The column is a continuous slice, including the middle age
>   name  age  salary
a  jack  100   30000
c  Lucy   98  999000
 Copy code

5.5.3 discontinuous multiple rows + discontinuous multiple columns

df7.loc[['a','c'],['name','salary']]
Note: the row is a discontinuous selection, only a and c
	 Columns are also discontinuous selections, but name and salary
>    name  salary
a  jack   30000
c  Lucy  999000
 Copy code

5.5.4 all rows + discontinuous multiple columns (the same for all columns)

df7.loc[:,['name','salary']]
Note: just write an empty slice of the line:
>   	    name  salary
	a       jack   30000
	b  HanMeimei   50000
	c       Lucy  999000
	d   Mr Green   90000
	e    Mrs Han   80000
	f       Lily   75000
 Copy code

5.5.5 discontinuous multiple rows + single column (the same for single row)

df7.loc[['a','c'],'name']
Note: single column name is not added[]，The result is a Series
> 	a    jack
	c    Lucy
	Name: name, dtype: object
	<class 'pandas.core.series.Series'>
Copy code

df7.loc[['a','c'],['name']]
type(df7.loc[['a','c'],['name']])
Note: single column name plus[]，The result is a DataFrame
>	   name
	a  jack
	c  Lucy
	<class 'pandas.core.frame.DataFrame'>
Copy code

6. iloc row and column slicing

The principle is the same as that of loc

Just note that the slice does not contain the last number, which is different from loc

df7.iloc[[1,3],[0]]
> Get discontinuous ranks
        name
b  HanMeimei
d   Mr Green


df7.iloc[1:3,0:1]
> Not including 3 d ，Not including 1 age
        name
b  HanMeimei
c       Lucy
 Copy code

7. Assignment change data

You can use loc or iloc

df7.iloc[1:3,1:3]=99999999
print(df7)
>       name       age    salary
a       jack       100     30000
b  HanMeimei  99999999  99999999
c       Lucy  99999999  99999999
d   Mr Green        90     90000
e    Mrs Han       100     80000
f       Lily        30     75000
 Copy code

8. Boolean index

Let's see an example

Create a dataframe

Score = {"full name": ["zhang wuji", "Zhao Min", "Little Joe", "Big Joe", "Yang Yuhuan", "army officer's hat ornaments", "Xi Shi", "prince", "Jiang Ziya", "Li Bai", "Du Fu", "Wang Wei","Li Xiaoyu"],
         "language": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100,100],
         "mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85,100],
         "English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95,100]}
df_score = pd.DataFrame(Score)
print(df_score)
Copy code

8.1 take out the data of all people whose English scores are greater than 90

# What you get is a Series
loc_ = df_score.loc[:,"English"] > 90
print(loc_)
print(type(loc_))# <class 'pandas.core.series.Series'>
# dataframe Boolean index, which will filter out all rows with the value of true
print(df_score[loc_])

# It can also be abbreviated as
print(df_score[df_score.loc[:,"English"]>90])
Copy code

8.2 take out the data of all people whose English scores are less than 90 (~)

Note: add ~ reverse

print(df_score[~(df_score.loc[:, "English"] > 90)])
Copy code

8.3 take out the data of all people with English scores greater than 90 and Chinese scores greater than 80

print(df_score[(df_score.loc[:, "English"] > 90)&(df_score.loc[:, "language"] < 80)])
Copy code

8.4 screenshot of knowledge points

9. String method

# Create a dataframe
student = {"full name": ["zhang wuji", "Zhao Min", "Little Joe", "Big Joe", "Yang Yuhuan", "army officer's hat ornaments", "Xi Shi", "prince", "Jiang Ziya", "Li Bai", "Du Fu", "Wang Wei", "Li Xiaoyu"],
           "language": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100, 100],
           "mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85, 100],
           "English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95, 100],
           "class": ["Class 3, grade 1", "Class 1, grade 1", "Class 3, grade 2", "Class 1, grade 2", "Class 13, grade 1", "Class 7, grade 3", "Class 3, grade 5", "Class 3, grade 4", "Class 5, grade 1", "Class 7, grade 1", "Class 4, grade 1",
                  "Class 9, grade 1", "Class 10, grade 1"],
           }
df_student = pd.DataFrame(student)
print(df_student)
Copy code

9.1 len - select the data whose element string length in [class] column is greater than 5

print(df_student[df_student["class"].str.len() > 5])
Copy code

9.2 replace - change [grade] in the element of [class] column to [first grade of school]

# Note that the right side of the equal sign returns a Series, which should be assigned to the column corresponding to the original DataFrame
df_student["class"] = df_student["class"].str.replace("first grade", "First grade of school")
print(df_student)
# The following is the loc usage of column fetching
df_student.loc[:,"class"] = df_student.loc[:,"class"].str.replace("first grade", "First grade of school")
Copy code

9.3 contains - filter the data of "school" and "1" in the [class] column

print(df_student[
          (df_student["class"].str.contains("school"))
          &
          (df_student["class"].str.contains("1"))])
Copy code

9.4 split -- cut string

9.5 get - print the first character of the student's name (last name)

print((df_student["full name"].str.get(0)))
Copy code

9.6 match - regular expression matching to find the data whose name contains' Wang | Li '

reg = 'king|Lee'
print(df_student[df_student["full name"].str.match(reg)])
Copy code

9.7 pad - fill character*

# Note that width=10 means that the current character + the * to be filled, and the width is 10
# Both sides are added with *, and the length of the final string is 10, which is not enough to be added with * (you can also directly use the center function without writing side)
df_student["full name"] = df_student["full name"].str.pad(width=10, side='both', fillchar='*')
# Add - on the right, and the length of the final string is 20, which is insufficient - add
df_student["full name"] = df_student["full name"].str.pad(width=20, side='right', fillchar='-')
print(df_student)
Copy code

9.5 screenshot of knowledge points

10. Add a column of statistical total score apply method

10.1 direct addition

df_student["Total score"] = df_student["language"] + df_student["mathematics"] + df_student["English"]
Copy code

10.2 traversal using the apply method of Series (apply passes in a function, which is more powerful)

df_student['Total score'] = pd.Series(df_student.index.tolist()).apply(
    lambda i: df_student.loc[i, "language"] + df_student.loc[i, "mathematics"] + df_student.loc[i, "English"])
    
# 1. In order to use the apply method of Series, a Series is generated according to the Index of DataFrame,
pd.Series(df_student.index.tolist())
# 2. The following is a lambda expression, which can also be passed in by defining a function (you can do a lot of processing by writing a function), as shown in the following example
 Copy code

# Let the person whose Chinese is greater than 90 add 1000 points to his Chinese score, and then calculate the total score
def sum1(i):
    if df_student.loc[i, "language"] > 90:
        df_student.loc[i, "language"] = df_student.loc[i, "language"] + 1000
    return df_student.loc[i, "language"] + df_student.loc[i, "mathematics"] + df_student.loc[i, "English"]


df_student['Total score'] = pd.Series(df_student.index.tolist()).apply(
    lambda i: sum1(i))
Copy code

11. Missing data processing

# Use numpy to generate a set of random integers (between 0 and 100, with the shape of 5 rows and 7 columns)
rand = np.random.randint(0, 100, (5, 7))
# Upload data generated using DataFrame
df = pd.DataFrame(rand, columns=list("ABCDEFG"))
# Define some NaN
df.loc[0:3, "A":"B"] = np.nan
print(df)
Copy code

11.1 judge whether it is NaN

11.1.1 judge whether the whole df is Nan

# Is it null
print(pd.isnull(df))
The result is: DataFrame
 Copy code

# Isn't it null
print(pd.notnull(df))
Copy code

11.1.2 judge whether df specified column is Nan

# Print the data with NUll in column A
print(df[pd.isnull(df["A"])])
Copy code

# Print data whose data in column A is not NUll
print(df[pd.notnull(df["A"])])
Copy code

11.2 delete data with nan in df

# Do not enter the how parameter. The default value is any
# As long as one is NaN, the row will be deleted
print(df.dropna(axis=0))
Copy code

# This row will be deleted only if all of them are NaN
print(df.dropna(axis=0,how="all"))
Copy code

Topics: Python pandas DataFrame

Programmer Think