Fundamentals of python data analysis 02 -- pandas related operations

Posted by dkjohnson on Mon, 07 Feb 2022 21:40:28 +0100

pandas advanced operations

import pandas as pd
import numpy as np
from pandas import DataFrame

Replace operation

  • The replacement operation can act synchronously in Series and DataFrame

  • Single valued substitution

    • Normal replacement: replace all qualified elements: to_replace=15,value=‘e’
    • Specify single value substitution by column: to_replace = {column label: replace value} value = 'value'
  • Multi valued substitution

    • List replacement: to_replace=[] value=[]
    • Dictionary replacement (recommended) to_replace={to_replace:value,to_replace:value}
df = DataFrame(data=np.random.randint(0,20,size=(8,5)))
df
01234
017017116
183773
29134193
312301016
4855414
51418213
61468155
72771811
#Global replacement
df.replace(to_replace=1,value='one')
01234
017017116
183773
29134193
312301016
4855414
514182one3
61468155
72771811
#Multi value replacement: replaces multiple values with the specified form
df.replace(to_replace={1:'one',3:'three'})
01234
017017116
18three77three
2913419three
312three01016
4855414
514182onethree
61468155
72771811
#Replacement of specified columns: the key of the dictionary represents the specified column index, and the value of the dictionary is the element to be replaced
df.replace(to_replace={3:17},value='one')
01234
0131014217
11724one9
22310one13
31715191315
41051774
5320013
6680161
7151310618

Mapping operation

  • Concept: create a mapping relationship list, bind the values element to a specific label or string (provide different expressions for an element value)

  • Create a df with two columns of name and salary, and then give its name the corresponding English name

dic = {
    'name':['Zhang San','Li Si','Wang Wu','Zhang San'],
    'salary':[5000,6000,5500,5000]
}
df = DataFrame(data=dic)
df
namesalary
0Zhang San5000
1Li Si6000
2Wang Wu5500
3Zhang San5000
#Mapping relation table
dic = {
    'Li Si':'Jerry',
    'Wang Wu':'Jack',
    'Zhang San':'Tom'
}
e_name = df['name'].map(dic)
df['e_name'] = e_name
df
namesalarye_name
0Zhang San5000Tom
1Li Si6000Jerry
2Wang Wu5500Jack
3Zhang San5000Tom
  • map is a Series method and can only be called by Series

Computing tools

  • Pay 50% tax on more than 3000 parts of the money and calculate everyone's after tax salary
def after_sal(s):
    return s - (s-3000)*0.5
after_salary = df['salary'].map(after_sal)
df['after_salary'] = after_salary
df
namesalarye_nameafter_salary
0Zhang San5000Tom4000.0
1Li Si6000Jerry4500.0
2Wang Wu5500Jack4250.0
3Zhang San5000Tom4000.0
  • The application method of Series can also act as an arithmetic tool like map
    • As a computing tool, the efficiency of apple is much higher than that of map
def after_sal(s):
    return s - (s-3000)*0.5
after_salary = df['salary'].apply(after_sal)
after_salary
0    4000.0
1    4500.0
2    4250.0
3    4000.0
Name: salary, dtype: float64

Random sampling for sorting implementation

  • take()
  • np.random.permutation(n): returns a random sequence between 0 - (n-1)
df = DataFrame(data=np.random.random(size=(100,3)),columns=['A','B','C'])
df.head()
ABC
00.4438350.9733310.626923
10.3666190.2250350.719683
20.4316390.1542590.051419
30.4069940.0650480.145229
40.9658490.3857880.208316
#Disrupt rows
df.take(np.random.permutation(100),axis=0)
ABC
880.8821210.3683320.158629
960.3190080.2515460.009901
280.6964310.1784690.125718
230.2680100.6396620.137387
220.5998990.0895350.930574
............
500.4689520.5628390.671872
790.2375260.2280430.699325
870.9321470.3942110.540368
900.9804010.8234020.624730
920.6991010.1080880.542962

100 rows × 3 columns

#Disrupt the ranks
df.take(np.random.permutation(100),axis=0).take(indices=np.random.permutation(3),axis=1)
BAC
460.2741980.6898430.119163
60.4935730.8547490.134710
550.5728230.1590190.124300
650.9824990.8229220.788931
50.5783930.9044010.866710
............
300.7340580.9480660.029316
590.1030370.7483190.025966
480.9880170.1948010.162126
800.5943850.4716180.945606
700.7501100.5925280.470297

100 rows × 3 columns

#Scramble the rows and columns, and then conduct random sampling
df.take(np.random.permutation(100),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]
ABC
380.6057190.8432360.339796
320.7795120.9445510.109526
700.2433080.7996650.374344
370.9291030.9073240.259208
210.9630180.8689060.874337

Classification and processing of data

  • The core of data classification processing:

    • groupby() function
    • Use the groups property to view groups
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df
itempricecolorweight
0Apple4.0red12
1Banana3.0yellow20
2Orange3.0yellow50
3Banana2.5green30
4Orange4.0green20
5Apple2.0green44
df.groupby(by='item')#The DataFrameGroupBy object is returned, and the grouping results are stored in this object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea8c556460>
#groups view group details
df.groupby(by='item').groups
{'Apple': [0, 5], 'Banana': [1, 3], 'Orange': [2, 4]}
#Calculate the average price of each fruit
df.groupby(by='item').mean()
df.groupby(by='item').mean()['price'] #Not recommended: wasted computing power
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
df.groupby(by='item')['price'].mean()#Recommended way
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
#Multiple grouping conditions
df.groupby(by=['item','color'])
df.groupby(by=['color','item']).sum()
priceweight
coloritem
greenApple2.044
Banana2.530
Orange4.020
redApple4.012
yellowBanana3.020
Orange3.050
#Demand: calculate the average price of each fruit, and then summarize the average price into the original table
mean_price_s = df.groupby(by='item')['price'].mean()
mean_price_s
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
dic = mean_price_s.to_dict() #Mapping relation table
df['mean_price'] = df['item'].map(dic)
df
itempricecolorweightmean_price
0Apple4.0red123.00
1Banana3.0yellow202.75
2Orange3.0yellow503.50
3Banana2.5green302.75
4Orange4.0green203.50
5Apple2.0green443.00
#Average weight of fruit of each color
mean_weight_s = df.groupby(by='color')['weight'].mean()
dic = mean_weight_s.to_dict()
mean_weight = df['color'].map(dic)
df['mean_weight'] = mean_weight
df
itempricecolorweightmean_pricemean_weight
0Apple4.0red123.0012.000000
1Banana3.0yellow202.7535.000000
2Orange3.0yellow503.5035.000000
3Banana2.5green302.7531.333333
4Orange4.0green203.5031.333333
5Apple2.0green443.0031.333333

Advanced data aggregation

  • After grouping with groupby, you can also use the custom functions provided by transform and apply to realize more operations
  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • Both transform and apply will perform operations. Just pass in the function in transform or apply
  • transform and apply can also pass in a lambda expression
def my_mean(p):
    sum = 0
    for i in p:
        sum += i
    return sum / len(p)
# The average price of each fruit can be summarized into the original table without mapping
df.groupby(by='item')['price'].transform(my_mean)
0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64
# The average price value of each fruit needs to be mapped before it can be summarized into the original table
df.groupby(by='item')['price'].apply(my_mean)
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

Data loading

  • Type - Txt file data
df = pd.read_csv('./data/type-.txt')
df
Hello. - I'm fine. - he's fine, too
0Maybe - probably - maybe
1However - not necessarily - not necessarily
df = pd.read_csv('./data/type-.txt',sep='-')
df
HelloI'm fineSo is he
0perhapsProbablybe on the cards
1howevernot necessarilyNot necessarily
df = pd.read_csv('./data/type-.txt',sep='-',header=None)
df
012
0HelloI'm fineSo is he
1perhapsProbablybe on the cards
2howevernot necessarilyNot necessarily
#\s + indicates one or more spaces
df = pd.read_csv('./data/test.txt',sep='\s+',header=None)
df
012
0HelloI'm fineSo is he
1perhapsProbablybe on the cards
2howevernot necessarilyNot necessarily
#Modify column index
df.columns = ['a','b','c']
df
abc
0HelloI'm fineSo is he
1perhapsProbablybe on the cards
2howevernot necessarilyNot necessarily
df.rename(index={0:'a1',1:'a2',2:'a3'})
abc
a1HelloI'm fineSo is he
a2perhapsProbablybe on the cards
a3howevernot necessarilyNot necessarily
#The parameter names can directly modify the column index when reading data
df = pd.read_csv('./data/test.txt',sep='\s+',header=None,names=['A','B','C'])
df
ABC
0HelloI'm fineSo is he
1perhapsProbablybe on the cards
2howevernot necessarilyNot necessarily

Pivot table

  • Pivot table is a table format that can dynamically arrange and summarize data. Perhaps most people have used PivotTable in Excel and realized its powerful functions. In pandas, it is called pivot_table.

  • Advantages of pivot table:

    • High flexibility, you can customize your analysis and calculation requirements at will
    • Clear context and easy to understand data
    • Strong operability, report artifact
df = pd.read_csv('./data/Pivot table-Basketball match.csv')
df.head()
opponentVictory or defeatHome and awayHitNumber of shotsShooting percentage3-point hit ratebackboardAssistsscore
0warriorwinpassenger10230.4350.44461127
1kingwinpassenger8210.3810.2863927
2Calfwinmain10190.5260.4623729
3Grizzliesnegativemain8200.4000.2505822
476 peoplewinpassenger10200.5000.25031327

pivot_table has four most important parameters: index, values, columns and aggfunc

  • index parameter: classification criteria of classification summary
    • Each pivot_table must have an index. If you want to see harden's score against each team, you need to classify each team and calculate the average of its various scores:
df.pivot_table(index='Victory or defeat')
3-point hit rateAssistsHitscoreShooting percentageNumber of shotsbackboard
Victory or defeat
win0.428199.66666710.23809532.9523810.48152421.1428575.142857
negative0.293508.5000007.50000027.2500000.35425021.2500004.750000
  • I want to see the data of harden against the same opponent in different home and away games. The classification conditions are opponent and home and away games
df.pivot_table(index=['opponent','Home and away'])
3-point hit rateAssistsHitscoreShooting percentageNumber of shotsbackboard
opponentHome and away
76 peoplemain0.42907.08.029.00.38121.04.0
passenger0.250013.010.027.00.50020.03.0
warriorpassenger0.444011.010.027.00.43523.06.0
kingpassenger0.28609.08.027.00.38121.03.0
sunpassenger0.54507.012.048.00.54522.02.0
Calfmain0.46207.010.029.00.52619.03.0
Knicksmain0.385010.012.037.00.44427.02.0
passenger0.35309.09.031.00.39123.05.0
pioneerpassenger0.57103.016.048.00.55229.08.0
Nuggetsmain0.14309.06.021.00.37516.08.0
Walkermain0.333010.08.029.00.36422.08.0
passenger0.250015.09.026.00.42921.05.0
Lakerspassenger0.44409.013.036.00.59122.04.0
Grizzliesmain0.33958.09.530.00.42022.54.5
passenger0.36107.57.524.50.38319.54.5
sirmain0.875013.019.056.00.76025.02.0
passenger0.33303.08.029.00.42119.05.0
Raptormain0.273011.08.038.00.32025.06.0
basket netmain0.61508.013.037.00.65020.010.0
eaglepassenger0.545011.08.029.00.53315.03.0
knightmain0.429013.08.035.00.38121.011.0
pelicanmain0.400017.08.026.00.50016.01.0
wasppassenger0.400011.08.027.00.44418.010.0
  • values parameter: the calculated data needs to be filtered
    • If we only need harden's score, rebounds and assists at home and away and in different winning and losing situations:
df.pivot_table(index=['Home and away','Victory or defeat'],values=['score','backboard','Assists'])
Assistsscorebackboard
Home and awayVictory or defeat
mainwin10.55555634.2222225.444444
negative8.66666729.6666675.000000
passengerwin9.00000032.0000004.916667
negative8.00000020.0000004.000000
  • Aggfunc parameter: sets the function operation we perform during data aggregation

    • When we do not set aggfunc, it defaults to aggfunc='mean 'to calculate the mean value.
  • When you want to get james harden's total score, total rebounds and total assists in home and away games and in different winning and losing situations:

df.pivot_table(index=['Home and away','Victory or defeat'],values=['score','backboard','Assists'],aggfunc='sum')
Assistsscorebackboard
Home and awayVictory or defeat
mainwin9530849
negative268915
passengerwin10838459
negative8204
df.pivot_table(index=['Home and away','Victory or defeat'],aggfunc={'score':'sum','backboard':'mean','Assists':'min'})
Assistsscorebackboard
Home and awayVictory or defeat
mainwin73085.444444
negative7895.000000
passengerwin33844.916667
negative8204.000000
  • Columns: you can set column hierarchy fields
    • Classify the values field
#Get the total score of all teams at home and away
df.pivot_table(index='Home and away',values='score',aggfunc='sum')
score
Home and away
main397
passenger404
#See who is the component of the total score of home and away games
df.pivot_table(index='Home and away',values='score',aggfunc='sum',columns='opponent',fill_value=0)
opponent76 peoplewarriorkingsunCalfKnickspioneerNuggetsWalkerLakersGrizzliessirRaptorbasket neteagleknightpelicanwasp
Home and away
main29000293702129060563837035260
passenger272727480314802636492900290027

Cross table

  • It is a special perspective used to calculate grouping and summarize data
  • pd.crosstab(index,colums)
    • Index: grouped data, row index of crosstab
    • columns: column index of crosstab
import pandas as pd
from pandas import DataFrame
df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})
df
sexagesmokeheight
0man15True168
1man23False179
2women25False181
3women17True166
4man35True173
5women57False178
6man24False188
7women31True190
8women22False160
  • Find the number of smokers of each sex
#grouping
df.groupby(by='sex')['smoke'].sum()
#perspective
df.pivot_table(index='sex',values='smoke',aggfunc='sum')
smoke
sex
man2
women2
pd.crosstab(df.smoke,df.sex)
sexmanwomen
smoke
False23
True22
pd.crosstab(df.sex,df.smoke)
smokeFalseTrue
sex
man22
women32
  • Find out the smoking situation of people of all ages
pd.crosstab(df.smoke,df.age)
age151722232425313557
smoke
False001111001
True110000110

Topics: Python Data Analysis Data Mining