Fundamentals of python data analysis 03 - exercise items

Posted by saidbakr on Mon, 07 Feb 2022 20:57:01 +0100

Population analysis cases

  • Requirements:
    • Import the file and view the original data
    • Summarize population data and state abbreviation data
    • Delete the duplicate abstraction column in the summarized data
    • View the columns with missing data in the summarized data
    • Find out which states / regions make the value of state NaN in the summarized data, and perform the de duplication operation
    • Fill in the correct values for the state items of these states / regions found, so as to remove all NaN in the state column
    • The summarized data and the area data of each state are summarized
    • We will find the missing data in the area(sq.mi) column and find out which rows
    • Remove rows with missing data
    • Find the population data for 2010
    • Calculate the population density of each state
import pandas as pd
import numpy as np
# The full name of state is abbreviated as abreviation state
abb = pd.read_csv('./state-abbrevs.csv')
abb
stateabbreviation
0AlabamaAL
1AlaskaAK
2ArizonaAZ
3ArkansasAR
4CaliforniaCA
5ColoradoCO
6ConnecticutCT
7DelawareDE
8District of ColumbiaDC
9FloridaFL
10GeorgiaGA
11HawaiiHI
12IdahoID
13IllinoisIL
14IndianaIN
15IowaIA
16KansasKS
17KentuckyKY
18LouisianaLA
19MaineME
20MontanaMT
21NebraskaNE
22NevadaNV
23New HampshireNH
24New JerseyNJ
25New MexicoNM
26New YorkNY
27North CarolinaNC
28North DakotaND
29OhioOH
30OklahomaOK
31OregonOR
32MarylandMD
33MassachusettsMA
34MichiganMI
35MinnesotaMN
36MississippiMS
37MissouriMO
38PennsylvaniaPA
39Rhode IslandRI
40South CarolinaSC
41South DakotaSD
42TennesseeTN
43TexasTX
44UtahUT
45VermontVT
46VirginiaVA
47WashingtonWA
48West VirginiaWV
49WisconsinWI
50WyomingWY
#state/region: abbreviation of state: ages: age year time population
pop = pd.read_csv('./state-population.csv')
#Full name of state area (sq. mi)
area = pd.read_csv('./state-areas.csv')
abb_pop = pd.merge(left=abb,right=pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop
stateabbreviationstate/regionagesyearpopulation
0AlabamaALALunder1820121117489.0
1AlabamaALALtotal20124817528.0
2AlabamaALALunder1820101130966.0
3AlabamaALALtotal20104785570.0
4AlabamaALALunder1820111125763.0
.....................
2539NaNNaNUSAtotal2010309326295.0
2540NaNNaNUSAunder18201173902222.0
2541NaNNaNUSAtotal2011311582564.0
2542NaNNaNUSAunder18201273708179.0
2543NaNNaNUSAtotal2012313873685.0

2544 rows × 6 columns

# Delete the duplicate abstraction column
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
# View columns with missing data
abb_pop.isnull().any(axis=0)
state            True
state/region    False
ages            False
year            False
population       True
dtype: bool
# Summarize the data to find out which states / regions make the value of state NaN and perform the operation of de duplication
# 1. Locate null values in the state column
ex = abb_pop['state'].isnull()
# 2. Take out the row data corresponding to the empty value of state
abb_pop.loc[ex]
# 3. Get the abbreviation data corresponding to empty state
abb_pop.loc[ex]['state/region']
# 4. De duplicate the abbreviation obtained in step 3
abb_pop.loc[ex]['state/region'].unique()
array(['PR', 'USA'], dtype=object)
# Fill in the correct values for the state items of these states / regions found, so as to remove all NaN in the state column
# 1. Locate the full name corresponding to the abbreviation of USA
abb_pop['state/region'] == 'USA'
# 2. The above Boolean value is used as the row index of metadata to get the row data of USA
abb_pop.loc[abb_pop['state/region'] == 'USA']
# 3. Get row index of USA
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
#4. Batch fill the values of the state column of the indexs rows into the full name of USA
abb_pop.loc[indexs,'state'] = 'United State'
abb_pop['state/region'] == 'PR'
abb_pop.loc[abb_pop['state/region'] == 'PR']
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[indexs,'state'] = 'PPPRRR'
#Detect whether there is still a null value in the state column
abb_pop['state'].isnull().sum()
# The summarized data and the area data of each state are summarized
abb_pop_area = pd.merge(left=abb_pop,right=area,on='state',how='outer')
abb_pop_area.head()
statestate/regionagesyearpopulationarea (sq. mi)
0AlabamaALunder182012.01117489.052423.0
1AlabamaALtotal2012.04817528.052423.0
2AlabamaALunder182010.01130966.052423.0
3AlabamaALtotal2010.04785570.052423.0
4AlabamaALunder182011.01125763.052423.0
# Find the missing data in the area(sq.mi) column, find out which rows are and remove the rows containing the missing data
abb_pop_area['area (sq. mi)'].isnull()
# Take out the row data corresponding to the hollow area (sq. mi) column
abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()]
# The row index corresponding to the area (sq. mi) column is obtained
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
            2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
            2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
            2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
            2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500, 2501, 2502,
            2503, 2504, 2505, 2506, 2507, 2508, 2509, 2510, 2511, 2512, 2513,
            2514, 2515, 2516, 2517, 2518, 2519, 2520, 2521, 2522, 2523, 2524,
            2525, 2526, 2527, 2528, 2529, 2530, 2531, 2532, 2533, 2534, 2535,
            2536, 2537, 2538, 2539, 2540, 2541, 2542, 2543],
           dtype='int64')
# Remove rows with missing data
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)
# Find the population data for 2010
abb_pop_area.query('year == 2010 & ages == "total"')
statestate/regionagesyearpopulationarea (sq. mi)
3AlabamaALtotal2010.04785570.052423.0
91AlaskaAKtotal2010.0713868.0656425.0
101ArizonaAZtotal2010.06408790.0114006.0
189ArkansasARtotal2010.02922280.053182.0
197CaliforniaCAtotal2010.037333601.0163707.0
283ColoradoCOtotal2010.05048196.0104100.0
293ConnecticutCTtotal2010.03579210.05544.0
379DelawareDEtotal2010.0899711.01954.0
389District of ColumbiaDCtotal2010.0605125.068.0
475FloridaFLtotal2010.018846054.065758.0
485GeorgiaGAtotal2010.09713248.059441.0
570HawaiiHItotal2010.01363731.010932.0
581IdahoIDtotal2010.01570718.083574.0
666IllinoisILtotal2010.012839695.057918.0
677IndianaINtotal2010.06489965.036420.0
762IowaIAtotal2010.03050314.056276.0
773KansasKStotal2010.02858910.082282.0
858KentuckyKYtotal2010.04347698.040411.0
869LouisianaLAtotal2010.04545392.051843.0
954MaineMEtotal2010.01327366.035387.0
965MontanaMTtotal2010.0990527.0147046.0
1050NebraskaNEtotal2010.01829838.077358.0
1061NevadaNVtotal2010.02703230.0110567.0
1146New HampshireNHtotal2010.01316614.09351.0
1157New JerseyNJtotal2010.08802707.08722.0
1242New MexicoNMtotal2010.02064982.0121593.0
1253New YorkNYtotal2010.019398228.054475.0
1338North CarolinaNCtotal2010.09559533.053821.0
1349North DakotaNDtotal2010.0674344.070704.0
1434OhioOHtotal2010.011545435.044828.0
1445OklahomaOKtotal2010.03759263.069903.0
1530OregonORtotal2010.03837208.098386.0
1541MarylandMDtotal2010.05787193.012407.0
1626MassachusettsMAtotal2010.06563263.010555.0
1637MichiganMItotal2010.09876149.096810.0
1722MinnesotaMNtotal2010.05310337.086943.0
1733MississippiMStotal2010.02970047.048434.0
1818MissouriMOtotal2010.05996063.069709.0
1829PennsylvaniaPAtotal2010.012710472.046058.0
1914Rhode IslandRItotal2010.01052669.01545.0
1925South CarolinaSCtotal2010.04636361.032007.0
2010South DakotaSDtotal2010.0816211.077121.0
2021TennesseeTNtotal2010.06356683.042146.0
2106TexasTXtotal2010.025245178.0268601.0
2117UtahUTtotal2010.02774424.084904.0
2202VermontVTtotal2010.0625793.09615.0
2213VirginiaVAtotal2010.08024417.042769.0
2298WashingtonWAtotal2010.06742256.071303.0
2309West VirginiaWVtotal2010.01854146.024231.0
2394WisconsinWItotal2010.05689060.065503.0
2405WyomingWYtotal2010.0564222.097818.0
# Calculate the population density of each state
midu = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area['midu'] = midu
abb_pop_area.head()
statestate/regionagesyearpopulationarea (sq. mi)midu
0AlabamaALunder182012.01117489.052423.021.316769
1AlabamaALtotal2012.04817528.052423.091.897221
2AlabamaALunder182010.01130966.052423.021.573851
3AlabamaALtotal2010.04785570.052423.091.287603
4AlabamaALunder182011.01125763.052423.021.474601

Data analysis of 2012 US election donation project

import numpy as np
import pandas as pd
# Define month, candidate and political party:
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}

parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

demand

  • Load data
  • View basic information of data
  • Specify data interception, extract the data of the following fields, and discard other data
    • cand_nm: name of candidate
    • contbr_nm: name of donor
    • contbr_st: donor state
    • contbr_ Employee: the company of the donor
    • contbr_occupation: donor occupation
    • contb_receipt_amt: donation amount (USD)
    • contb_receipt_dt: date of donation
  • Summarize the new data and check whether there is missing data
  • Use statistical indicators to quickly describe the summary of numerical attributes.
  • Null value processing. Relevant fields may have blank values due to forgetting to fill in or confidentiality. Fill them with NOT PROVIDE
  • Exception handling. Delete the data of donation amount < = 0
  • Create a new column for the party of each candidate
  • Check the different elements in the party column
  • Count the occurrence times of each element in the party column
  • Check the total amount of political contributions received by each party_ receipt_ amt
  • Check the total amount of political contributions received by each party every day_ receipt_ amt
  • Convert the date format in the table to 'yyyy MM DD'.
  • See who veterans (donor profession) DISABLED VETERAN mainly supports
df = pd.read_csv('./data/usa_election.txt',low_memory=False)
df = df[['cand_nm','contbr_nm','contbr_st','contbr_employer','contbr_occupation','contb_receipt_amt','contb_receipt_dt']]
df.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dt
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.005-JUL-11
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cand_nm            536041 non-null  object 
 1   contbr_nm          536041 non-null  object 
 2   contbr_st          536040 non-null  object 
 3   contbr_employer    525088 non-null  object 
 4   contbr_occupation  530520 non-null  object 
 5   contb_receipt_amt  536041 non-null  float64
 6   contb_receipt_dt   536041 non-null  object 
dtypes: float64(1), object(6)
memory usage: 28.6+ MB
#Fill in all missing data as NOT PROVIDE
df.fillna(value='NOT PROVIDE',inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cand_nm            536041 non-null  object 
 1   contbr_nm          536041 non-null  object 
 2   contbr_st          536041 non-null  object 
 3   contbr_employer    536041 non-null  object 
 4   contbr_occupation  536041 non-null  object 
 5   contb_receipt_amt  536041 non-null  float64
 6   contb_receipt_dt   536041 non-null  object 
dtypes: float64(1), object(6)
memory usage: 28.6+ MB
#Exception handling. Delete the data of donation amount < = 0
(df['contb_receipt_amt'] <= 0).sum()
5727
~(df['contb_receipt_amt'] <= 0)
df = df.loc[~(df['contb_receipt_amt'] <= 0)]
#Create a new column for the party of each candidate
df['party'] = df['cand_nm'].map(parties)
df.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11Republican
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11Republican
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.005-JUL-11Republican
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11Republican
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11Republican
#Check the different elements in the party column
df['party'].unique()
array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)
df
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11Republican
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11Republican
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.005-JUL-11Republican
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11Republican
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11Republican
...........................
536036Perry, RickANDERSON, MARILEE MRS.XXINFORMATION REQUESTED PER BEST EFFORTSINFORMATION REQUESTED PER BEST EFFORTS2500.031-AUG-11Republican
536037Perry, RickTOLBERT, DARYL MR.XXT.A.C.C.LONGWALL MAINTENANCE FOREMAN500.030-SEP-11Republican
536038Perry, RickGRANE, BRYAN F. MR.XXINFORMATION REQUESTED PER BEST EFFORTSINFORMATION REQUESTED PER BEST EFFORTS500.029-SEP-11Republican
536039Perry, RickDUFFY, DAVID A. MR.XXDUFFY EQUIPMENT COMPANY INC.BUSINESS OWNER2500.030-SEP-11Republican
536040Perry, RickGORMAN, CHRIS D. MR.XXINFORMATION REQUESTED PER BEST EFFORTSINFORMATION REQUESTED PER BEST EFFORTS5000.029-SEP-11Republican

530314 rows × 8 columns

#Count the occurrence times of each element in the party column
df['party'].value_counts()
Democrat       289999
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64
#Check the total amount of political contributions received by each party_ receipt_ amt
df.groupby(by='party')['contb_receipt_amt'].sum()
party
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Republican     1.251181e+08
Name: contb_receipt_amt, dtype: float64
#Check the total amount of political contributions received by each party every day_ receipt_ amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()
contb_receipt_dt  party      
01-APR-11         Reform             50.00
                  Republican      12635.00
01-AUG-11         Democrat       182198.00
                  Libertarian      1000.00
                  Reform           1847.00
                                   ...    
31-MAY-11         Republican     313839.80
31-OCT-11         Democrat       216971.87
                  Libertarian      4250.00
                  Reform           3205.00
                  Republican     751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64
#Convert the date format in the table to 'yyyy MM DD'
def transform_date(d):
    day,month,year = d.split('-')
    month = months[month]
    return '20'+year+'-'+str(month)+'-'+day
    
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(transform_date)
df.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.02011-6-20Republican
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.02011-6-23Republican
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.02011-7-05Republican
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.02011-8-01Republican
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.02011-6-20Republican
# See who the disabled veterans most support
# Take out the data containing only veterans' occupations
df['contbr_occupation'] == 'DISABLED VETERAN'
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']
old_bing_df.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
149790Obama, BarackMAHURIN, DAVIDFLVETERANS ADMINISTRATIONDISABLED VETERAN10.02012-1-17Democrat
150910Obama, BarackMAHURIN, DAVIDFLVETERANS ADMINISTRATIONDISABLED VETERAN20.02012-1-01Democrat
174041Obama, BarackKRUCHTEN, MICHAELILDISABLEDDISABLED VETERAN50.02011-12-02Democrat
175244Obama, BarackKRUCHTEN, MICHAELILDISABLEDDISABLED VETERAN250.02011-10-12Democrat
183790Obama, BarackBRYANT, J.L.KSRET ARMYDISABLED VETERAN100.02011-10-12Democrat
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()
cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

Topics: Python Data Analysis Data Mining