DataFrame, the second of the two main data structures of Pandas

Posted by guitarist809 on Wed, 19 Jan 2022 00:19:15 +0100

DataFrame is a tabular data structure. It contains a set of ordered columns. Each column can be of different value types (numeric value, string, Boolean value, etc.).

DataFrame has both row indexes and column indexes. Compared with other similar data structures (such as R's data.frame), the row oriented and column oriented operations of DataFrame are basically balanced.

The data in the DataFrame is generally stored in one or more two-dimensional blocks (rather than lists, dictionaries or other one-dimensional data structures).

Create DataFrame

Create a DataFrame using a dictionary whose value type is list

In [13]: people = {'name': ['a', 'b', 'c', 'd'],
  ...: 'age': [15, 20, 25, 30],
  ...: 'gender': ['male', 'female', 'female', 'male']}

In [14]: df = pd.DataFrame(people)

In [15]: df
Out[15]: 
  name  age  gender
0    a   15    male
1    b   20  female
2    c   25  female
3    d   30    male

You can use the parameter columns to change the order of columns

In [16]: df_test_columns = pd.DataFrame(people, columns=['gender', 'name', 'age'])

In [17]: df_test_columns
Out[17]: 
   gender name  age
0    male    a   15
1  female    b   20
2  female    c   25
3    male    d   30

As with Series, NA values are generated if the incoming columns are not found in the data

In [32]: df2 = pd.DataFrame(data, columns=['pop', 'state', 'year', 'debt'],
    ...: index=['one', 'two', 'three', 'four', 'five'])

In [33]: df2
Out[33]: 
       pop   state  year debt
one    1.5    Ohio  2000  NaN
two    1.7    Ohio  2001  NaN
three  3.6    Ohio  2002  NaN
four   2.4  Nevada  2001  NaN
five   3.9  Nevada  2002  NaN

In [34]: df2.columns
Out[34]: Index(['pop', 'state', 'year', 'debt'], dtype='object')

In actual use, it is more common to read tables to generate DateFrame, so we don't write too many methods to generate DateFrame.

Column access and processing of DataFrame

There are two ways to obtain DataFrame columns: Dictionary tag form and attribute form

In [35]: df2.state
Out[35]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [36]: df2['state']
Out[36]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

Modify the value of a column by assigning (a value or a group of values)

In [37]: df2.debt = 16.5

In [38]: df2
Out[38]: 
       pop   state  year  debt
one    1.5    Ohio  2000  16.5
two    1.7    Ohio  2001  16.5
three  3.6    Ohio  2002  16.5
four   2.4  Nevada  2001  16.5
five   3.9  Nevada  2002  16.5

In [45]: df2['debt'] = np.arange(5.)

In [46]: df2
Out[46]: 
       pop   state  year  debt
one    1.5    Ohio  2000   0.0
two    1.7    Ohio  2001   1.0
three  3.6    Ohio  2002   2.0
four   2.4  Nevada  2001   3.0
five   3.9  Nevada  2002   4.0

When assigning a list or array to a column, its length must match the DataFrame. If Series is assigned, the index of DataFrame will be exactly matched, and all empty bits will be filled with the missing value.

In [47]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [48]: df2.debt = val

In [49]: df2
Out[49]: 
       pop   state  year  debt
one    1.5    Ohio  2000   NaN
two    1.7    Ohio  2001  -1.2
three  3.6    Ohio  2002   NaN
four   2.4  Nevada  2001  -1.5
five   3.9  Nevada  2002  -1.7

Assigning a value to a column that does not exist creates a new column.

In [55]: df2['eastern'] = df2.state == 'Ohio'

In [56]: df2
Out[56]: 
       pop   state  year  debt  eastern
one    1.5    Ohio  2000   NaN     True
two    1.7    Ohio  2001  -1.2     True
three  3.6    Ohio  2002   NaN     True
four   2.4  Nevada  2001  -1.5    False
five   3.9  Nevada  2002  -1.7    False

Delete a column. You can use the del keyword or dataframe drop()
```
In [57]: del df2['eastern']

In [58]: df2
Out[58]: 
       pop   state  year  debt
one    1.5    Ohio  2000   NaN
two    1.7    Ohio  2001  -1.2
three  3.6    Ohio  2002   NaN
four   2.4  Nevada  2001  -1.5
five   3.9  Nevada  2002  -1.7

In [65]: df2['debt'].loc['two']
Out[65]: -1.2
```
```
DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
```
Parameter meaning:
- labels: the row or column to be deleted, given in a list
- Axis: the default value is 0, which means that rows are to be deleted, and axis must be specified as 1 when deleting columns
- index: directly specify the row to delete. To delete multiple rows, you can use the list as a parameter
  Columns: directly specify the columns to be deleted. To delete multiple columns, you can use the list as a parameter
  Inplace: the default value is False. The deletion operation does not change the original data; When inplace = True, the original data is changed
The column returned by index is only a view of the data, not a copy. Therefore, any in place modifications to the returned Series will directly affect the source DataFrame. Through the copy method of Series, you can explicitly copy columns.

This place is extremely prone to setting with copywarning. For specific reasons and solutions, refer to: Principle and solution of setting with copywarning in Pandas

Row access for DataFrame

DateFrame has two operators for rows: iloc and loc.

iloc indexes the index and loc indexes the column names, especially when the column names are 0, 1, 2, 3 Both of them can work normally, but there are differences in essence.

Index is not a numeric index of 0, 1, 2, 3

In [10]: df1 = pd.DataFrame(people, index=['w', 'x', 'y', 'z'])

In [11]: df1
Out[11]: 
  name  age  gender
w    a   15    male
x    b   20  female
y    c   25  female
z    d   30    male

In [12]: df1.iloc[0]
Out[12]: 
name         a
age         15
gender    male
Name: w, dtype: object

In [13]: df1.loc['w']
Out[13]: 
name         a
age         15
gender    male
Name: w, dtype: object

Index is not a numeric index such as 0, 1, 2 and 3

In [15]: df2 = pd.DataFrame(people, index=range(1,5))

In [16]: df2
Out[16]: 
  name  age  gender
1    a   15    male
2    b   20  female
3    c   25  female
4    d   30    male

In [18]: df2.iloc[0]
Out[18]: 
name         a
age         15
gender    male
Name: 1, dtype: object

In [19]: df2.loc[1]
Out[19]: 
name         a
age         15
gender    male
Name: 1, dtype: object

It can be seen that when the index is a digital index, both can be successfully indexed, but the actual indexing method is still different.

The iloc index is the index order, that is, whether it is the 0th, while the loc index is whether the index value is equal to 1.

Topics: Data Analysis

Programmer Think

DataFrame, the second of the two main data structures of Pandas

Create DataFrame

Column access and processing of DataFrame

Row access for DataFrame

Hot Topics