Beginners in python take notes to get started with pandas

Posted by bdlang on Sat, 01 Jan 2022 07:00:25 +0100

Applied Data Science with Python Web Course from the University of Michigan based on Data Analysis with Python and coursera

Introduction to Pandas

There are two common data structures: Series and DataFrame

1 Series

  1. An array object of one dimension contains a sequence of values and an index, which can be interpreted as an array in Numpy.
    Another way to think about it is that it's a fixed-length, ordered dictionary with index and data values paired by location.
obj=pd.Series([4,7,-5,3],index=['d','a','b','c'])
obj2=pd.Series({'d':4,'a':7,'b':-5,'c':3})
obj2
-------------------------------------------
Out[]:
d  4
a  7
b  -5
c  3

#Changing the index by location assignment
obj.index=['0','1','2','3']
obj
--------
out[]:
0  4
1  7
2  -5
3  3

#Use labels for indexing, note the brackets
obj2[['c','a','b']]
-----------------
Out[]:
c  3
a  7
b  -5

#bool value array filtering, multiplying by scalars, or applying mathematical functions that will save index value joins
obj2[obj2>0]
obj2*2
np.exp(obj2)

#Judging whether there is
'b' in obj2
-------------
out[]:
True
  1. About missing values
    pandas flag missing values with NA values, and function categories return bool values with isnull() and notnull()
states=['a','c','b','d','e']
obj3=pd.Series({'d':4,'a':7,'b':-5,'c':3},index=states)
----------
out[3]:
a  7
c  3
b  -5
d  4
e  NaN

#Functional judgment of missing values (two methods)
pd.isnull(obj3)
obj4.isnull()

Auto-aligned indexes in Series also produce NaN

obj2+obj3
----------------
out[4]:
a  7
c  3
b  -5
d  4
e  NaN

Series calls to the reindex method (rebuilding the index) also introduce missing values

obj3=obj.reindex(['a','b','c','d','e'])
  1. The Series object itself and its index have a name attribute, which is equivalent to naming the entire Series object and its index
obj.name='population'
obj.index.name='state'
  1. Series'values property returns a one-dimensional array
obj.values
----
out[]:
array([3,7,-5])

2 DataFrame

  1. A data table representing a matrix containing an array of columns, each of which can be of a different value type (int,string,bool, and so on)
  2. There are both row and column indexes, and indexes can be shared.

2.1 First form of building a DataFrame

Formed from a dictionary containing array s of equal length list s or Numpy. It can also be a dictionary that contains a dictionary.

data={'a':['oh','yes','ok'],'b':['wha','tht','wh'],'c':[10,20,30]}
frame=pd.DataFrame(data,columns=['b','c','a'])

# Select the first 5 rows	
frame.head()

#Index by dictionary tag or attribute
frame['a']
frame.a
  1. You can also add an index parameter like Series, where the column index parameter in the frame is columns, changing the row or column index by position assignment. (code omitted)
  2. The reference to a column can be modified to assign a scalar value or an array of values.
  3. You can add new empty columns through the column parameter and assign values to the columns. If the assigned column does not exist, create a new column directly.
#Assign a scalar
frame['d']=16
out[]:
   a    b   c  d
0  oh  wha 10 16
1  yes tht 20 16
2  ok  wh  30 16

#Assign an array
frame['d']=np.arange(3.)
  1. Returns an array of bool values for judgments used in frame s
frame['e']=(frame['b']=='wha')
  1. Delete frame column
del frame['e']
  1. The columns selected in the DataFrame are views of the data, not copies, so modifications to Series are mapped to the DataFrame.

2.2 Constructing the second form of the DataFrame

A dictionary contains the form of a dictionary: the key of the dictionary acts as a column, and the key of the internal dictionary acts as a row index.

2.3 Constructing the third form of the DataFrame

Contains a Series dictionary.

pdata={'a':frame['a'][:-1],'b':frame['b'][:-1]}
  1. Frame's values property returns an n-dimensional array

  2. Indexed objects cannot be changed individually, but any array s and tag sequences can be converted internally to indexed objects.

index[1]='m'   #typrerror
  1. The in judgment of the DataFrame: The columns or index attribute can be judged.

3 Basic Functions

3.1 Interpolation methods in rebuilding indexes

That is, after the index is rebuilt, some index values do not exist before, and the missing values are populated with interpolation. Method optional parameter: Interpolates when rebuilding the index, and the ffill method represents a forward fill.

obj3.reindex([range(3),method='ffill']) #quotation mark

reindex can change row indexes, column indexes, or both.

3.2 More commonly used loc method - label index

The iloc method works like the loc method except that the first parameter is not a label name but an integer.
Notice the brackets

states=['Texas','Utah','California']
frame.loc[['a','b','c','d'],states]

#loc can also be sliced
frame.loc[[:'c'],states]

3.3 Axis Delete

Simply put, the parameter inplace=True means to directly manipulate the original object. If False, it is the default value. New objects are created by default and returned.

data.drop(['Texas','Utah]',axis='columns',inplace=True) # or axis=1

3.4 Indexing, Selection and Filtering

Common python slices do not contain tails, but Series slices differ in that they contain the beginning and end.
Series differs from dataframe in that it requires a loc method rather than a direct index.
If you have an axis index that contains integers, use the label index or the loc and iloc methods.

3.5 Arithmetic data alignment

If an index is different, the union of the index pairs is returned, and the parts that do not overlap are filled with NaN.
Method for handling missing values: fill_value

df1.add(df2,fill_value=0) #Fill in missing values for non-overlapping parts of frame df1 and frame df2 with 0

df1.reindex(columns=df2.columns,fill_value=0)

3.6 Operation between dataframe and series - Broadcast mechanism

arr-arr[0]  #Each row of arr performs an operation that broadcasts to each row

frame.sub(series1,axis=0)  # or axis='index'broadcast to column

If the index value does not exist in both the dataframe and Seres, the object rebuilds the index and forms a union that is populated with missing values.

3.7 Function Application

Numpy's generic function== (element by element array method)==is also valid for pandas objects.
A common operation is to apply a function to a one-dimensional array of functions in a row or column, using the apply method of the dataframe.

frame=pd.DataFrame(np.random.randn(4,3),index=['U','O','T','Or'],columns=['b','d','e'])
f=lambda x:x.max()-x.min()
frame.apply(f)
-------------------------------
Out[]:
b    1.920418
d    1.788882
e    1.255265
dtype: float64

applymap applies element-by-element functions to Series; And Series'own method is map

f=lambda x:'%.2f'%x
frame['e'].map(f)
---------------
out[]:
U      0.47
O     -0.41
T     -0.28
Or     0.72
Name: e, dtype: object
-----------------------------
frame.applymap(f)
----------
out[]:
	b	d	e
U	1.53	0.85	0.47
O	-0.34	-1.10	-0.41
T	-0.31	-1.91	-0.28
Or	0.29	0.99	0.72

3.8 Sorting and Ranking

Use sort_index, which returns a new, sorted object. Missing values are sorted to the end. Setting the parameter by acts like SQL.

frame.sort_values(by='b')
--------
out[]:
b	d	e
O	-0.344316	-1.104560	-0.413454
T	-0.313397	-1.911685	-0.279829
Or	0.294956	0.992794	0.717557
U	1.525719	0.848279	0.471139

Ranking: An operation that assigns ranks to an array from 1 to the total number of valid data points using the rank() function of Series and DataFrame.

In [10]:obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
In [11]:obj.rank()
Out [11]:
0    6.5	# index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by (6+7)/2 = 6.5
1    1.0
2    6.5	# index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by (6+7)/2 = 6.5
3    4.5	# index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by (4+5)/2 = 4.5
4    3.0
5    2.0
6    4.5	# index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by (4+5)/2 = 4.5
dtype: float64

In [12]:obj.rank(method='min')
Out [12]:
0    6.0	# index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by the smallest rank 6
1    1.0
2    6.0	# index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by the smallest rank 6
3    4.0	# index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by the smallest rank 4
4    3.0
5    2.0
6    4.0	# index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by the smallest rank 4
dtype: float64

In [14]:obj.rank(method='first')
Out [14]:
0    6.0	# Index=0 and index=2 are both 7, and index=0 appears first in the normal ranking. This is identified by Rank 6
1    1.0
2    7.0	# index=0 and index=2 are both 7, which occurs after index=2 in normal ranking. This is identified by Rank 7
3    4.0	# Index=3 and index=6 are both 4, and index=3 appears first in the normal ranking. This is identified by ranking 4
4    3.0
5    2.0
6    5.0	# Both index=3 and index=6 are 4, which occurs after index=6 in the normal ranking. This is identified by Rank 5
dtype: float64

Topics: Python Data Analysis pandas