DataFrame, the second of the two main data structures of Pandas

Posted by guitarist809 on Wed, 19 Jan 2022 00:19:15 +0100

DataFrame is a tabular data structure. It contains a set of ordered columns. Each column can be of different value types (numeric value, string, Boolean value, etc.).

DataFrame has both row indexes and column indexes. Compared with other similar data structures (such as R's data.frame), the row oriented and column oriented operations of DataFrame are basically balanced.

The data in the DataFrame is generally stored in one or more two-dimensional blocks (rather than lists, dictionaries or other one-dimensional data structures).

Create DataFrame

  • Create a DataFrame using a dictionary whose value type is list

    In [13]: people = {'name': ['a', 'b', 'c', 'd'],
      ...: 'age': [15, 20, 25, 30],
      ...: 'gender': ['male', 'female', 'female', 'male']}
    
    In [14]: df = pd.DataFrame(people)
    
    In [15]: df
    Out[15]: 
      name  age  gender
    0    a   15    male
    1    b   20  female
    2    c   25  female
    3    d   30    male
    

    You can use the parameter columns to change the order of columns

    In [16]: df_test_columns = pd.DataFrame(people, columns=['gender', 'name', 'age'])
    
    In [17]: df_test_columns
    Out[17]: 
       gender name  age
    0    male    a   15
    1  female    b   20
    2  female    c   25
    3    male    d   30
    
  • As with Series, NA values are generated if the incoming columns are not found in the data

    In [32]: df2 = pd.DataFrame(data, columns=['pop', 'state', 'year', 'debt'],
        ...: index=['one', 'two', 'three', 'four', 'five'])
    
    In [33]: df2
    Out[33]: 
           pop   state  year debt
    one    1.5    Ohio  2000  NaN
    two    1.7    Ohio  2001  NaN
    three  3.6    Ohio  2002  NaN
    four   2.4  Nevada  2001  NaN
    five   3.9  Nevada  2002  NaN
    
    In [34]: df2.columns
    Out[34]: Index(['pop', 'state', 'year', 'debt'], dtype='object')
    

In actual use, it is more common to read tables to generate DateFrame, so we don't write too many methods to generate DateFrame.

Column access and processing of DataFrame

  • There are two ways to obtain DataFrame columns: Dictionary tag form and attribute form

    In [35]: df2.state
    Out[35]: 
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    Name: state, dtype: object
    
    In [36]: df2['state']
    Out[36]: 
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    Name: state, dtype: object
    
  • Modify the value of a column by assigning (a value or a group of values)

    In [37]: df2.debt = 16.5
    
    In [38]: df2
    Out[38]: 
           pop   state  year  debt
    one    1.5    Ohio  2000  16.5
    two    1.7    Ohio  2001  16.5
    three  3.6    Ohio  2002  16.5
    four   2.4  Nevada  2001  16.5
    five   3.9  Nevada  2002  16.5
    
    In [45]: df2['debt'] = np.arange(5.)
    
    In [46]: df2
    Out[46]: 
           pop   state  year  debt
    one    1.5    Ohio  2000   0.0
    two    1.7    Ohio  2001   1.0
    three  3.6    Ohio  2002   2.0
    four   2.4  Nevada  2001   3.0
    five   3.9  Nevada  2002   4.0
    

    When assigning a list or array to a column, its length must match the DataFrame. If Series is assigned, the index of DataFrame will be exactly matched, and all empty bits will be filled with the missing value.

    In [47]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
    
    In [48]: df2.debt = val
    
    In [49]: df2
    Out[49]: 
           pop   state  year  debt
    one    1.5    Ohio  2000   NaN
    two    1.7    Ohio  2001  -1.2
    three  3.6    Ohio  2002   NaN
    four   2.4  Nevada  2001  -1.5
    five   3.9  Nevada  2002  -1.7
    
  • Assigning a value to a column that does not exist creates a new column.

    In [55]: df2['eastern'] = df2.state == 'Ohio'
    
    In [56]: df2
    Out[56]: 
           pop   state  year  debt  eastern
    one    1.5    Ohio  2000   NaN     True
    two    1.7    Ohio  2001  -1.2     True
    three  3.6    Ohio  2002   NaN     True
    four   2.4  Nevada  2001  -1.5    False
    five   3.9  Nevada  2002  -1.7    False
    
  • Delete a column. You can use the del keyword or dataframe drop()

    In [57]: del df2['eastern']
    
    In [58]: df2
    Out[58]: 
           pop   state  year  debt
    one    1.5    Ohio  2000   NaN
    two    1.7    Ohio  2001  -1.2
    three  3.6    Ohio  2002   NaN
    four   2.4  Nevada  2001  -1.5
    five   3.9  Nevada  2002  -1.7
    
    In [65]: df2['debt'].loc['two']
    Out[65]: -1.2
    
    DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
    

    Parameter meaning:

    • labels: the row or column to be deleted, given in a list
    • Axis: the default value is 0, which means that rows are to be deleted, and axis must be specified as 1 when deleting columns
    • index: directly specify the row to delete. To delete multiple rows, you can use the list as a parameter
      Columns: directly specify the columns to be deleted. To delete multiple columns, you can use the list as a parameter
      Inplace: the default value is False. The deletion operation does not change the original data; When inplace = True, the original data is changed

    The column returned by index is only a view of the data, not a copy. Therefore, any in place modifications to the returned Series will directly affect the source DataFrame. Through the copy method of Series, you can explicitly copy columns.

    This place is extremely prone to setting with copywarning. For specific reasons and solutions, refer to: Principle and solution of setting with copywarning in Pandas

Row access for DataFrame

DateFrame has two operators for rows: iloc and loc.

iloc indexes the index and loc indexes the column names, especially when the column names are 0, 1, 2, 3 Both of them can work normally, but there are differences in essence.

  • Index is not a numeric index of 0, 1, 2, 3

    In [10]: df1 = pd.DataFrame(people, index=['w', 'x', 'y', 'z'])
    
    In [11]: df1
    Out[11]: 
      name  age  gender
    w    a   15    male
    x    b   20  female
    y    c   25  female
    z    d   30    male
    
    In [12]: df1.iloc[0]
    Out[12]: 
    name         a
    age         15
    gender    male
    Name: w, dtype: object
    
    In [13]: df1.loc['w']
    Out[13]: 
    name         a
    age         15
    gender    male
    Name: w, dtype: object
    
  • Index is not a numeric index such as 0, 1, 2 and 3

    In [15]: df2 = pd.DataFrame(people, index=range(1,5))
    
    In [16]: df2
    Out[16]: 
      name  age  gender
    1    a   15    male
    2    b   20  female
    3    c   25  female
    4    d   30    male
    
    In [18]: df2.iloc[0]
    Out[18]: 
    name         a
    age         15
    gender    male
    Name: 1, dtype: object
    
    In [19]: df2.loc[1]
    Out[19]: 
    name         a
    age         15
    gender    male
    Name: 1, dtype: object
    

    It can be seen that when the index is a digital index, both can be successfully indexed, but the actual indexing method is still different.

    The iloc index is the index order, that is, whether it is the 0th, while the loc index is whether the index value is equal to 1.

Topics: Data Analysis