Pandas-01-DataFrame|Series Common Methods

Posted by designationlocutus on Sat, 06 Jun 2020 02:56:32 +0200

This paper introduces the data structure DataFrame and eries of data analysis star pack Pandas in Python, which is practical.

Copyright Statement: This is an original blogger article, please indicate the source of the original text for reproducing, no white lyrics!

Writing time: 2020-06-06 0:08

Catalog

1. Introduction to pandas

2. Series of pandas data structure

#Pandas.SeriesQuick Creation

#Pandas.SeriesTake out all values:

#Pandas.SeriesRemove index:index

#Pandas.SeriesBe similar toNumpy.ndarryPerformance

#Pandas.SeriesValue by index value

#Pandas.SeriesDictionary-like performance

3. DataFrame of pandas Data Structure

#DataFame Creation

#Pandas.DataFrameColumn fetching operation in

Pandas.DataFrameSelect a column from

Pandas.DataFrameSelect columns in method 1

Pandas.DataFrameSelect columns from method 2

#Pandas.DataFrameRow fetching in

Pandas.DataFramePick a row from

Pandas.DataFrameSelect rows from

Pandas.DataFrameFilter rows by a condition in

Pandas.DataFrameFilter rows by multiple criteria in

Pandas.DataFrameCombining map and lambda functions to filter rows in

Pandas.DataFrameCombined isin filter line in

#Pandas.DataFrameSelect columns from rows

#Pandas.DataFrameMedium replacement value: where method

#Pandas.DataFrameFind a value for a location in: lookup method

#Pandas.DataFrameAdd new column at end of middle

#Pandas.DataFrameAdd a new column at the specified location in: insert method

#Pandas.DataFrameDelete column in: del method

#Pandas.DataFrameCopy: copy method

#Pandas.DataFrameLook at the first and last lines: head|tail

#Series,Pandas.DataFrameNumber of unique values for a column in:unique

#Pandas.DataFrameFrequency of occurrence of each element in the statistics column: value_counts method

#Pandas.DataFrameGroup and count by columns: groupby+count

#Pandas.DataFrameGroup by a column and sum:

#Pandas.DataFrameGroup by column and remove a group: groupby+get_group

#Pandas.DataFramesort

Pandas.DataFrameSort by row label or column label: sort_index method

Pandas.DataFrameSort by column value: sort_values method by parameter

#Pandas.DataFrameDescriptive Statistics

#Pandas.DataFrameGet row and column labels/header names

4. References

1. Introduction to pandas

  • The pandas module is built on numpy, is fast (the layer algorithm has been optimized with Cython), and has a large number of functions that directly operate on each element.
  • It is often used with packages such as Matplotlib, seaborn, statsmodels, scikit-learn, etc.
  • pandas is better at working with Tabular data than numpy, where each column of data type can be inconsistent and numpy arrays dictate that data types need to be consistent.
  • There are two main data structures: one-dimensional Series (similar toNumpy.ndarray) with row labels, 2-D DataFrame (better than R)Data.frameRicher), with row and column labels;
  • Use of pandas can be searched here, click the stamp: https://pandas.pydata.org/pandas-docs/stable/reference/index.html .
  • Import the pandas package, the python community abbreviates the pandas module as pd by default, importing the module together:
In [8]: import pandas as pd
In [9]: from pandas import Series,DataFrame#Import two data structures.

2. Series of pandas data structure

  • Series is a one-dimensional labeled array with its own tags, structured as follows:

  • Series has a number of properties and functions:
In [27]: len(dir(Series))
Out[27]: 471#More than 400 methods

See: https://pandas.pydata.org/pandas-docs/stable/reference/series.html

#Pandas.SeriesQuick Creation

  • Grammar:pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
  • data: a dictionary in python, iterative objects,Numpy.ndarryOr scalar values
  • index: optional, defaults to list [0,..., len(data) - 1], can be passed in by itself
  • dtype: optional, set data type
In [38]: data = pd.Series([1,2,3,4,5])#No index value specified, default index value is [0,..., len(data) - 1]
In [39]: data
Out[39]:
0 1
1 2
2 3
3 4
4 5
dtype: int64


#Pandas.SeriesCustom Index Value
In [47]: data1 = pd.Series([1,2,3,4,5],index=list('abcde'))
In [48]: data1
Out[48]:
a 1
b 2
c 3
d 4
e 5
dtype: int64

#Pandas.SeriesTake out all values:

Returns an array.*

In [49]: data1.values
Out[49]: array([1, 2, 3, 4, 5], dtype=int64)

#Pandas.SeriesRemove index:index

Returns an array

In [50]: data1.index
Out[50]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object') 

#Pandas.SeriesBe similar toNumpy.ndarryPerformance

Attribute view: data type (dtype), shape, dimension (ndim), number of elements (size), index, slicing, Boolean filtering, and so on.

#Pandas.SeriesValue by index value

In [58]: data1['a']#Take a single value, similar to keying a value in a dictionary
Out[58]: 1

In [10]: data1[['a','b']]#Incoming Index Value List takes a set of values
Out[10]:
a    1
b    2
dtype: int64
//Equivalent to
In [14]: data1[0]
Out[14]: 1

In [15]: data1[[0,1]]
Out[15]:
a    1
b    2
dtype: int64

#Pandas.SeriesDictionary-like performance

In [18]: 'a' in data1#judgePandas.SeriesIndex Value Membership
Out[18]: True

In [17]: 1 in data1.values#judgePandas.SeriesElement Membership
Out[17]: True

In [20]: data1['a'] = '1234'#modifyPandas.SeriesElement Value
In [21]: data1
Out[21]:
a    1234
b       2
c       3
d       4
e       5
dtype: int64

3. DataFrame of pandas Data Structure

  • The DataFrame structure is illustrated below and can be interpreted as a tabular structure with row and column indexes.
  • Each column is a Series object, and the data type of each column can be different.
  • With a large number of properties and functions: detail stamps, https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

#DataFame Creation

  • Grammar:pandas.DataFrame(data=None, index: Optional[Collection] = None, columns: Optional[Collection] = None, dtype: Union[str,numpy.dtype, Extension Dtype, None] = None, copy: bool = False)
  • data
  • index: optional, row labels
  • columns: optional, column label
  • dtype: optional, element data type

There are many ways to create it, two of which are listed:

#Create using a dictionaryPandas.DataFame
In [40]: d = {'col1': [1, 2], 'col2': [3, 4]}
    ...: df = pd.DataFrame(d,dtype=np.int8)#dtype Specifies the element data type
In [41]: df
Out[41]:
   col1  col2
0     1     3
1     2     4

In [42]: df.dtypes#View data types
Out[42]:
col1    int8
col2    int8
dtype: object

In [29]: df
Out[29]:
   col1  col2
0     1     3
1     2     4

#Use 2-DNumpy.ndarryEstablishPandas.DataFame
In [43]: df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
    ...:                    columns=['a', 'b', 'c'])# columns Specify Column Labels

In [44]: df2
Out[44]:
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

#Pandas.DataFrameColumn fetching operation in

 

 

#Create Experimental DataFrame
In [140]: d = {'one': pd.Series([2., 2., 3.,4.], index=['a', 'b', 'c','d']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd']),'three': pd.Series([3., 1., 3., 4.], index=['a', 'b', 'c', 'd'])}
In [141]: df = pd.DataFrame(d)
In [142]: df
Out[142]:
   one  two  three
a  2.0  1.0    3.0
b  2.0  2.0    1.0
c  3.0  3.0    3.0
d  4.0  4.0    4.0

Pandas.DataFrameSelect a column from

In [143]: df['one']
Out[143]:
a    2.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64

Pandas.DataFrameSelect columns in method 1

In [144]: df[['one','three']]#[[]], place a 1-index value list in brackets
Out[144]:
   one  three
a  2.0    3.0
b  2.0    1.0
c  3.0    3.0
d  4.0    4.0

Pandas.DataFrameSelect columns from method 2

In [76]: df.iloc[:,[0,2]]#df.iloc[Row index, [Column index number]]
Out[76]:
one three
a 2.0 3.0
b 2.0 1.0
c 3.0 3.0
d 4.0 4.0

#Pandas.DataFrameRow fetching in

 

Pandas.DataFramePick a row from

In [94]: df
Out[94]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [95]: df.loc[['a']]
Out[95]:
one two three
a 2.0 1.0 3.0
In [106]: type(df.loc[['a']])#Return a DataFrame object
Out[106]: pandas.core.frame.DataFrame

In [96]: df.loc['a']#There is a difference between using ['a'] and ['a']]
Out[96]:
one 2.0
two 1.0
three 3.0
In [105]: type(df.loc['a'])#Return a Series object
Out[105]: pandas.core.series.Series

Pandas.DataFrameSelect rows from

In [97]: df.loc[['a','d']]
Out[97]:
one two three
a 2.0 1.0 3.0
d 4.0 4.0 4.0

Pandas.DataFrameFilter rows by a condition in

Boolean vector assisted filtering is passed in.

In [118]: df
Out[118]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [114]: df['two']#Remove'two'column
Out[114]:
a 1.0
b 2.0
c 3.0
d 4.0
Name: two, dtype: float64

In [115]: df['two'] > 2#'two'column judgement returns Boolean Series object
Out[115]:
a False
b False
c True
d True
Name: two, dtype: bool

In [116]: df[df['two'] > 2]#Remove rows with element values greater than 2 in the'two'column
Out[116]:
one two three
c 3.0 3.0 3.0
d 4.0 4.0 4.0

Pandas.DataFrameFilter rows by multiple criteria in

A logical operator is used:

| or

& and

~ Reverse

Note that each condition is enclosed in parentheses, as follows:

In [125]: df[(df['two'] >= 2) & (df['three'] >= 3)]#Take rows where the'two'column is greater than or equal to 2 and the'three' column is greater than or equal to 3
Out[125]:
one two three
c 3.0 3.0 3.0
d 4.0 4.0 4.0

Pandas.DataFrameCombining map and lambda functions to filter rows in

In [136]: df
Out[136]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [137]: fileter_ = df['one'].map(lambda x: str(x).startswith('2'))#Remove characters that begin with 2 in the'one'column
In [138]: df[fileter_]
Out[138]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0

Pandas.DataFrameCombined isin filter line in

In [139]: df
Out[139]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [140]: df['one'].isin([1,2])#Remove rows with a value of 1 or 2 from the'one'column
Out[140]:
a True
b True
c False
d False
Name: one, dtype: bool

In [141]: df[df['one'].isin([1,2])]
Out[141]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0

#Pandas.DataFrameSelect columns from rows

 

In [101]: df
Out[101]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [102]: df.iloc[[0,3],[0,1]]#Row 0, Column 1 and Column 2 of Row 3
Out[102]:
one two
a 2.0 1.0
d 4.0 4.0

#Pandas.DataFrameMedium replacement value: where method

where operates on elements that do not meet the criteria and does not change the original data.

In [151]: df
Out[151]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [152]: df.where(df['one'] < 3, df-1)#Perform a minus 1 operation on rows (rows c and d) where the'one'column element is greater than or equal to 3
Out[152]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 2.0 2.0 2.0
d 3.0 3.0 3.0

#Pandas.DataFrameFind a value for a location in: lookup method

(Personally, I don't feel like any eggs are useful)

In [166]: df
Out[166]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [167]: df.lookup(['d','a'],['one','three'])#Find values with coordinates ('d','one') and ('a','three')
Out[167]: array([4., 3.])

In [168]: df.lookup(['a','d'],['one','three'])
Out[168]: array([2., 4.])

#Pandas.DataFrameAdd new column at end of middle

The original data will be modified directly and used with caution.

In [169]: df1
Out[169]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [170]: df1['five'] = -df1['three']#Add column at end by default
In [171]: df1
Out[171]:
one two three five
a 2.0 1.0 3.0 -3.0
b 2.0 2.0 1.0 -1.0
c 3.0 3.0 3.0 -3.0
d 4.0 4.0 4.0 -4.0

#Pandas.DataFrameAdd a new column at the specified location in: insert method

In [186]: df.insert(1, 'new_one', df['one'])#Add a new column after the first column
In [187]: df
Out[187]:
one new_one two three
a 2.0 2.0 1.0 3.0
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0

#Pandas.DataFrameDelete column in: del method

The original data will be modified directly and used with caution.

In [177]: del df1['five']
In [178]: df1
Out[178]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

#Pandas.DataFrameCopy: copy method

del can be avoided, insert can modify the original data directly.

In [189]: df1 = df.copy()
In [190]: df1
Out[190]:
one new_one two three
a 2.0 2.0 1.0 3.0
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0

In [192]: del df1['new_one']
In [193]: df1#df1 modified
Out[193]:
one two three
a 2.0 1.0 3.0
b 2.0 2.0 1.0
c 3.0 3.0 3.0
d 4.0 4.0 4.0

In [194]: df#df not modified
Out[194]:
one new_one two three
a 2.0 2.0 1.0 3.0
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0

#Pandas.DataFrameLook at the first and last lines: head|tail

In [199]: df
Out[199]:
one new_one two three
a 2.0 2.0 1.0 3.0
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0

In [200]: df.head(3)#View the first three lines
Out[200]:
one new_one two three
a 2.0 2.0 1.0 3.0
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0

In [201]: df.tail(3)#View the last three lines
Out[201]:
one new_one two three
b 2.0 2.0 2.0 1.0
c 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0

#Series,Pandas.DataFrameNumber of unique values for a column in:unique

In [68]: pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c' , 'c']).unique()
Out[68]: array(['c', 'a', 'd', 'b'], dtype=object)

#Pandas.DataFrameFrequency of occurrence of each element in the statistics column: value_counts method

In [251]: df1
Out[251]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 NaN
d NaN 4.0

In [252]: df1['one'].value_counts()
Out[252]:
3.0 1
2.0 1
1.0 1
Name: one, dtype: int64

In [253]: df1['one'].value_counts(dropna=False)#Do not skip defaults
     ...:
Out[253]:
NaN 1
3.0 1
2.0 1
1.0 1
Name: one, dtype: int64

#Pandas.DataFrameGroup and count by columns: groupby+count

In [5]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
   ...: ('bird', 'Psittaciformes', 24.0),
   ...: ('mammal', 'Carnivora', 80.2),
   ...: ('mammal', 'Primates', np.nan),
   ...: ('mammal', 'Carnivora', 58)],
   ...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
   ...: columns=('class', 'order', 'max_speed'))

In [6]: df
Out[6]: 
          class           order  max_speed
falcon     bird   Falconiformes      389.0
parrot     bird  Psittaciformes       24.0
lion     mammal       Carnivora       80.2
monkey   mammal        Primates        NaN
leopard  mammal       Carnivora       58.0

In [7]: df.groupby(['class','order']).count()
Out[7]: 
                       max_speed
class  order                    
bird   Falconiformes           1
       Psittaciformes          1
mammal Carnivora               2
       Primates                0

#Pandas.DataFrameGroup by a column and sum:

 

In [17]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
In [18]: df2
Out[18]:
X Y
0 B 1
1 B 2
2 A 3
3 A 4

In [19]: df2.groupby(['X']).sum()
Out[19]:
Y
X
A 7
B 3

In [20]: df2.groupby(['X'],sort=False).sum()
Out[20]:
Y
X
B 3
A 7

 

#Pandas.DataFrameGroup by column and remove a group: groupby+get_group

In [27]: df2.groupby(['X']).get_group('A')
Out[27]:
X Y
2 A 3
3 A 4

#Count the number of groupby groupings
In [28]: len(df2.groupby(['X'],sort=False).get_group('A'))
Out[28]: 2

#Pandas.DataFramesort

Pandas.DataFrameSort by row label or column label: sort_index method

By default, ascending sets the sort order.

In [31]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [32]: frame
Out[32]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [36]: frame
Out[36]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [37]: frame.sort_index()#Row Index Sorting
Out[37]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [38]: frame.sort_index(axis=1)#Column Index Sorting
Out[38]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

In [39]: frame.sort_index(axis=1, ascending=False)#Column index sort, ascending to set ascending or descending sort
Out[39]:
d c b a
three 0 3 2 1
one 4 7 6 5

Pandas.DataFrameSort by column value: sort_values method by parameter

In [41]: frame.sort_values(by='d')#Sort by column d value, ascending by default
Out[41]:
d a b c
three 0 1 2 3
one 4 5 6 7

In [42]: frame.sort_values(by='d',ascending=False)#ascending set descending sort
Out[42]:
d a b c
one 4 5 6 7
three 0 1 2 3

In [43]: frame.sort_values(by=['c','d'],ascending=False)#Sort by the values of c and d, that is, by column c values first, and then by column D values when column c values are the same
Out[43]:
d a b c
one 4 5 6 7
three 0 1 2 3

#Pandas.DataFrameDescriptive Statistics

Common functions for describing and summarizing statistics in DPandas
 
Method Description
 Number of count non-NA values
 describe calculates summary statistics for columns in Series or DataFrame, with different return structures for numbers and characters
 Min, Max minimum and maximum
 Index position of argmin, argmax minimum and maximum (integer)
Index values for idxmin, idxmax minimum and maximum
 Quantile sample quantile (0 to 1)
sum of sums
 mean
 median
 mad calculates mean absolute deviation from mean
 var variance
 std standard deviation
 Skew sample value skew (third-order moment)
kurt sample value kurtosis (fourth-order moment)
Cumulative sum of cumsum sample values
 Cummin, cummax sample values cumulative maximum and cumulative minimum
 Cumulative product of cumprod sample values
 diff calculates first-order difference (useful for time series)
pct_change calculates percentage change
In [44]: df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
In [45]: df
Out[45]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

In [46]: df.sum()#Sum each column by default
Out[46]:
one 9.25
two -5.80
dtype: float64

In [47]: df.sum(axis = 1)#Pass in the parameter axis, finding the sum of each line
Out[47]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64

In [48]: df.describe()#Output multiple statistics for each column value, skipping directly for NaN values
Out[48]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

In [53]: pd.Series(['a', 'a', 'b', 'c']).describe()
Out[53]:
count 4
unique 3#Number of characters after reduplication
top a
freq 2
dtype: object

#Pandas.DataFrameGet row and column labels/header names

In [75]: df
Out[75]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

In [76]: list(df.columns.values)#Column Label
Out[76]: ['one', 'two']

In [77]: list(df.index.values)#Row Label
Out[77]: ['a', 'b', 'c', 'd']

4. References

https://pandas.pydata.org/pandas-docs/stable/index.html

Python for Data Analysis, 2nd Edition

Python Handbook of Data Science

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Topics: Python Lambda Attribute