DataFrame is a tabular data structure. It contains a set of ordered columns. Each column can be of different value types (numeric value, string, Boolean value, etc.).
DataFrame has both row indexes and column indexes. Compared with other similar data structures (such as R's data.frame), the row oriented and column oriented operations of DataFrame are basically balanced.
The data in the DataFrame is generally stored in one or more two-dimensional blocks (rather than lists, dictionaries or other one-dimensional data structures).
Create DataFrame
-
Create a DataFrame using a dictionary whose value type is list
In [13]: people = {'name': ['a', 'b', 'c', 'd'], ...: 'age': [15, 20, 25, 30], ...: 'gender': ['male', 'female', 'female', 'male']} In [14]: df = pd.DataFrame(people) In [15]: df Out[15]: name age gender 0 a 15 male 1 b 20 female 2 c 25 female 3 d 30 male
You can use the parameter columns to change the order of columns
In [16]: df_test_columns = pd.DataFrame(people, columns=['gender', 'name', 'age']) In [17]: df_test_columns Out[17]: gender name age 0 male a 15 1 female b 20 2 female c 25 3 male d 30
-
As with Series, NA values are generated if the incoming columns are not found in the data
In [32]: df2 = pd.DataFrame(data, columns=['pop', 'state', 'year', 'debt'], ...: index=['one', 'two', 'three', 'four', 'five']) In [33]: df2 Out[33]: pop state year debt one 1.5 Ohio 2000 NaN two 1.7 Ohio 2001 NaN three 3.6 Ohio 2002 NaN four 2.4 Nevada 2001 NaN five 3.9 Nevada 2002 NaN In [34]: df2.columns Out[34]: Index(['pop', 'state', 'year', 'debt'], dtype='object')
In actual use, it is more common to read tables to generate DateFrame, so we don't write too many methods to generate DateFrame.
Column access and processing of DataFrame
-
There are two ways to obtain DataFrame columns: Dictionary tag form and attribute form
In [35]: df2.state Out[35]: one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object In [36]: df2['state'] Out[36]: one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object
-
Modify the value of a column by assigning (a value or a group of values)
In [37]: df2.debt = 16.5 In [38]: df2 Out[38]: pop state year debt one 1.5 Ohio 2000 16.5 two 1.7 Ohio 2001 16.5 three 3.6 Ohio 2002 16.5 four 2.4 Nevada 2001 16.5 five 3.9 Nevada 2002 16.5
In [45]: df2['debt'] = np.arange(5.) In [46]: df2 Out[46]: pop state year debt one 1.5 Ohio 2000 0.0 two 1.7 Ohio 2001 1.0 three 3.6 Ohio 2002 2.0 four 2.4 Nevada 2001 3.0 five 3.9 Nevada 2002 4.0
When assigning a list or array to a column, its length must match the DataFrame. If Series is assigned, the index of DataFrame will be exactly matched, and all empty bits will be filled with the missing value.
In [47]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) In [48]: df2.debt = val In [49]: df2 Out[49]: pop state year debt one 1.5 Ohio 2000 NaN two 1.7 Ohio 2001 -1.2 three 3.6 Ohio 2002 NaN four 2.4 Nevada 2001 -1.5 five 3.9 Nevada 2002 -1.7
-
Assigning a value to a column that does not exist creates a new column.
In [55]: df2['eastern'] = df2.state == 'Ohio' In [56]: df2 Out[56]: pop state year debt eastern one 1.5 Ohio 2000 NaN True two 1.7 Ohio 2001 -1.2 True three 3.6 Ohio 2002 NaN True four 2.4 Nevada 2001 -1.5 False five 3.9 Nevada 2002 -1.7 False
-
Delete a column. You can use the del keyword or dataframe drop()
In [57]: del df2['eastern'] In [58]: df2 Out[58]: pop state year debt one 1.5 Ohio 2000 NaN two 1.7 Ohio 2001 -1.2 three 3.6 Ohio 2002 NaN four 2.4 Nevada 2001 -1.5 five 3.9 Nevada 2002 -1.7 In [65]: df2['debt'].loc['two'] Out[65]: -1.2
DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
Parameter meaning:
- labels: the row or column to be deleted, given in a list
- Axis: the default value is 0, which means that rows are to be deleted, and axis must be specified as 1 when deleting columns
- index: directly specify the row to delete. To delete multiple rows, you can use the list as a parameter
Columns: directly specify the columns to be deleted. To delete multiple columns, you can use the list as a parameter
Inplace: the default value is False. The deletion operation does not change the original data; When inplace = True, the original data is changed
The column returned by index is only a view of the data, not a copy. Therefore, any in place modifications to the returned Series will directly affect the source DataFrame. Through the copy method of Series, you can explicitly copy columns.
This place is extremely prone to setting with copywarning. For specific reasons and solutions, refer to: Principle and solution of setting with copywarning in Pandas
Row access for DataFrame
DateFrame has two operators for rows: iloc and loc.
iloc indexes the index and loc indexes the column names, especially when the column names are 0, 1, 2, 3 Both of them can work normally, but there are differences in essence.
-
Index is not a numeric index of 0, 1, 2, 3
In [10]: df1 = pd.DataFrame(people, index=['w', 'x', 'y', 'z']) In [11]: df1 Out[11]: name age gender w a 15 male x b 20 female y c 25 female z d 30 male In [12]: df1.iloc[0] Out[12]: name a age 15 gender male Name: w, dtype: object In [13]: df1.loc['w'] Out[13]: name a age 15 gender male Name: w, dtype: object
-
Index is not a numeric index such as 0, 1, 2 and 3
In [15]: df2 = pd.DataFrame(people, index=range(1,5)) In [16]: df2 Out[16]: name age gender 1 a 15 male 2 b 20 female 3 c 25 female 4 d 30 male In [18]: df2.iloc[0] Out[18]: name a age 15 gender male Name: 1, dtype: object In [19]: df2.loc[1] Out[19]: name a age 15 gender male Name: 1, dtype: object
It can be seen that when the index is a digital index, both can be successfully indexed, but the actual indexing method is still different.
The iloc index is the index order, that is, whether it is the 0th, while the loc index is whether the index value is equal to 1.