Applied Data Science with Python Web Course from the University of Michigan based on Data Analysis with Python and coursera
Introduction to Pandas
There are two common data structures: Series and DataFrame
1 Series
- An array object of one dimension contains a sequence of values and an index, which can be interpreted as an array in Numpy.
Another way to think about it is that it's a fixed-length, ordered dictionary with index and data values paired by location.
obj=pd.Series([4,7,-5,3],index=['d','a','b','c']) obj2=pd.Series({'d':4,'a':7,'b':-5,'c':3}) obj2 ------------------------------------------- Out[]: d 4 a 7 b -5 c 3 #Changing the index by location assignment obj.index=['0','1','2','3'] obj -------- out[]: 0 4 1 7 2 -5 3 3 #Use labels for indexing, note the brackets obj2[['c','a','b']] ----------------- Out[]: c 3 a 7 b -5 #bool value array filtering, multiplying by scalars, or applying mathematical functions that will save index value joins obj2[obj2>0] obj2*2 np.exp(obj2) #Judging whether there is 'b' in obj2 ------------- out[]: True
- About missing values
pandas flag missing values with NA values, and function categories return bool values with isnull() and notnull()
states=['a','c','b','d','e'] obj3=pd.Series({'d':4,'a':7,'b':-5,'c':3},index=states) ---------- out[3]: a 7 c 3 b -5 d 4 e NaN #Functional judgment of missing values (two methods) pd.isnull(obj3) obj4.isnull()
Auto-aligned indexes in Series also produce NaN
obj2+obj3 ---------------- out[4]: a 7 c 3 b -5 d 4 e NaN
Series calls to the reindex method (rebuilding the index) also introduce missing values
obj3=obj.reindex(['a','b','c','d','e'])
- The Series object itself and its index have a name attribute, which is equivalent to naming the entire Series object and its index
obj.name='population' obj.index.name='state'
- Series'values property returns a one-dimensional array
obj.values ---- out[]: array([3,7,-5])
2 DataFrame
- A data table representing a matrix containing an array of columns, each of which can be of a different value type (int,string,bool, and so on)
- There are both row and column indexes, and indexes can be shared.
2.1 First form of building a DataFrame
Formed from a dictionary containing array s of equal length list s or Numpy. It can also be a dictionary that contains a dictionary.
data={'a':['oh','yes','ok'],'b':['wha','tht','wh'],'c':[10,20,30]} frame=pd.DataFrame(data,columns=['b','c','a']) # Select the first 5 rows frame.head() #Index by dictionary tag or attribute frame['a'] frame.a
- You can also add an index parameter like Series, where the column index parameter in the frame is columns, changing the row or column index by position assignment. (code omitted)
- The reference to a column can be modified to assign a scalar value or an array of values.
- You can add new empty columns through the column parameter and assign values to the columns. If the assigned column does not exist, create a new column directly.
#Assign a scalar frame['d']=16 out[]: a b c d 0 oh wha 10 16 1 yes tht 20 16 2 ok wh 30 16 #Assign an array frame['d']=np.arange(3.)
- Returns an array of bool values for judgments used in frame s
frame['e']=(frame['b']=='wha')
- Delete frame column
del frame['e']
- The columns selected in the DataFrame are views of the data, not copies, so modifications to Series are mapped to the DataFrame.
2.2 Constructing the second form of the DataFrame
A dictionary contains the form of a dictionary: the key of the dictionary acts as a column, and the key of the internal dictionary acts as a row index.
2.3 Constructing the third form of the DataFrame
Contains a Series dictionary.
pdata={'a':frame['a'][:-1],'b':frame['b'][:-1]}
-
Frame's values property returns an n-dimensional array
-
Indexed objects cannot be changed individually, but any array s and tag sequences can be converted internally to indexed objects.
index[1]='m' #typrerror
- The in judgment of the DataFrame: The columns or index attribute can be judged.
3 Basic Functions
3.1 Interpolation methods in rebuilding indexes
That is, after the index is rebuilt, some index values do not exist before, and the missing values are populated with interpolation. Method optional parameter: Interpolates when rebuilding the index, and the ffill method represents a forward fill.
obj3.reindex([range(3),method='ffill']) #quotation mark
reindex can change row indexes, column indexes, or both.
3.2 More commonly used loc method - label index
The iloc method works like the loc method except that the first parameter is not a label name but an integer.
Notice the brackets
states=['Texas','Utah','California'] frame.loc[['a','b','c','d'],states] #loc can also be sliced frame.loc[[:'c'],states]
3.3 Axis Delete
Simply put, the parameter inplace=True means to directly manipulate the original object. If False, it is the default value. New objects are created by default and returned.
data.drop(['Texas','Utah]',axis='columns',inplace=True) # or axis=1
3.4 Indexing, Selection and Filtering
Common python slices do not contain tails, but Series slices differ in that they contain the beginning and end.
Series differs from dataframe in that it requires a loc method rather than a direct index.
If you have an axis index that contains integers, use the label index or the loc and iloc methods.
3.5 Arithmetic data alignment
If an index is different, the union of the index pairs is returned, and the parts that do not overlap are filled with NaN.
Method for handling missing values: fill_value
df1.add(df2,fill_value=0) #Fill in missing values for non-overlapping parts of frame df1 and frame df2 with 0 df1.reindex(columns=df2.columns,fill_value=0)
3.6 Operation between dataframe and series - Broadcast mechanism
arr-arr[0] #Each row of arr performs an operation that broadcasts to each row frame.sub(series1,axis=0) # or axis='index'broadcast to column
If the index value does not exist in both the dataframe and Seres, the object rebuilds the index and forms a union that is populated with missing values.
3.7 Function Application
Numpy's generic function== (element by element array method)==is also valid for pandas objects.
A common operation is to apply a function to a one-dimensional array of functions in a row or column, using the apply method of the dataframe.
frame=pd.DataFrame(np.random.randn(4,3),index=['U','O','T','Or'],columns=['b','d','e']) f=lambda x:x.max()-x.min() frame.apply(f) ------------------------------- Out[]: b 1.920418 d 1.788882 e 1.255265 dtype: float64
applymap applies element-by-element functions to Series; And Series'own method is map
f=lambda x:'%.2f'%x frame['e'].map(f) --------------- out[]: U 0.47 O -0.41 T -0.28 Or 0.72 Name: e, dtype: object ----------------------------- frame.applymap(f) ---------- out[]: b d e U 1.53 0.85 0.47 O -0.34 -1.10 -0.41 T -0.31 -1.91 -0.28 Or 0.29 0.99 0.72
3.8 Sorting and Ranking
Use sort_index, which returns a new, sorted object. Missing values are sorted to the end. Setting the parameter by acts like SQL.
frame.sort_values(by='b') -------- out[]: b d e O -0.344316 -1.104560 -0.413454 T -0.313397 -1.911685 -0.279829 Or 0.294956 0.992794 0.717557 U 1.525719 0.848279 0.471139
Ranking: An operation that assigns ranks to an array from 1 to the total number of valid data points using the rank() function of Series and DataFrame.
In [10]:obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) In [11]:obj.rank() Out [11]: 0 6.5 # index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by (6+7)/2 = 6.5 1 1.0 2 6.5 # index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by (6+7)/2 = 6.5 3 4.5 # index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by (4+5)/2 = 4.5 4 3.0 5 2.0 6 4.5 # index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by (4+5)/2 = 4.5 dtype: float64
In [12]:obj.rank(method='min') Out [12]: 0 6.0 # index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by the smallest rank 6 1 1.0 2 6.0 # index=0 and index=2 are both 7, which should be 6 and 7 If ranked normally. This is identified by the smallest rank 6 3 4.0 # index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by the smallest rank 4 4 3.0 5 2.0 6 4.0 # index=3 and index=6 are both 4, which should be 4 and 5 if ranked normally. This is identified by the smallest rank 4 dtype: float64
In [14]:obj.rank(method='first') Out [14]: 0 6.0 # Index=0 and index=2 are both 7, and index=0 appears first in the normal ranking. This is identified by Rank 6 1 1.0 2 7.0 # index=0 and index=2 are both 7, which occurs after index=2 in normal ranking. This is identified by Rank 7 3 4.0 # Index=3 and index=6 are both 4, and index=3 appears first in the normal ranking. This is identified by ranking 4 4 3.0 5 2.0 6 5.0 # Both index=3 and index=6 are 4, which occurs after index=6 in the normal ranking. This is identified by Rank 5 dtype: float64