Principal Component Analysis and Factor Analysis

Posted by Irresistable on Tue, 02 Nov 2021 19:35:48 +0100

Question 1

# Guide Pack
import numpy as np
import pandas as pd
# Incoming Matrix Parameter
R1 = np.array([[1, 0.577, 0.509, 0.387, 0.462], [0.577, 1, 0.599, 0.389, 0.322], [0.509, 0.599, 1, 0.436, 0.426], [0.387, 0.389, 0.436, 1, 0.523], [0.462, 0.322, 0.426, 0.523, 1]])
# Finding the parameters and eigenvectors of a matrix
eigValues1, eigVects1 = np.linalg.eig(R1)   
eigValues1, eigVects1
(array([2.85671099, 0.80916372, 0.34294995, 0.53967524, 0.4515001 ]),
 array([[ 0.46360519,  0.24033901,  0.45126217,  0.61170545, -0.38663456],
        [ 0.45710783,  0.50930473, -0.67622331, -0.17818949, -0.20647436],
        [ 0.47017564,  0.26044828,  0.40000721, -0.33505645,  0.66244469],
        [ 0.42145876, -0.5256649 ,  0.17559859, -0.54076277, -0.47200602],
        [ 0.42122445, -0.58196989, -0.38502448,  0.43517554,  0.38243876]]))

Set No. i i i Principal Component Y i = ∑ k = 1 n e k X k Y_i=\sum_{k=1}^{n} e_kX_k Yi =k=1n ek Xk. Based on mathematical derivation, the orthogonal unit feature vector corresponding to the eigenvalues from large to small is the combination coefficient of the corresponding principal components, and the resulting principal components satisfy Y 1 Y_1 Y1 Contains as much information as possible, Y i ( 2 ≤ i ≤ n ) Y_i(2 \leq i \leq n) Yi (2 < i < n) contains as much as possible except Y k ( k ≤ i − 1 ) Y_k(k \leq i-1) Information other than Yk (k < i_1) is equal to the variance of each principal component.

Making principal component analysis from correlation matrix is equivalent to making principal component analysis from the covariance matrix of standardized data, and the eigenvalues and eigenvectors are:

characteristic valuefeature vector
λ 1 \lambda_1 λ1​=2.857 e 1 e_1 e1​=(0.467,0.457,0.470,0.421,0.421)
λ 2 \lambda_2 λ2​=0.809 e 2 e_2 e2​=(0.240,0.509,0.260,-0.526,-0.581)
λ 3 \lambda_3 λ3​=0.540 e 3 e_3 e3​=(0.612,-0.178,-0.335,-0.541,0.435)
λ 4 \lambda_4 λ4​=0.452 e 4 e_4 e4​=(-0.387,-0.206,0.662,-0.472,0.382)
λ 5 \lambda_5 λ5​=0.343 e 5 e_5 e5​=(0.451,-0.676,0.400,0.176,-0.385)

therefore X ∗ X^{*} The five principal components of X_are:
Y k = e k X ∗ Y_k=e_kX^{*} Yk​=ek​X∗

# Calculate contribution ratio
def contributionRate(values):
    v = list(values)
    l = len(v)
    result = np.array([v[i]*100 / sum(v) for i in range(l)])
    add = np.empty([l, ])
    add[0] = result[0]
    for i in range(1, l):
        add[i] = add[i-1] + result[i]
    return [result, add]
contribution = pd.DataFrame() # Create a data frame with variable names and eigenvalues
contribution['eigValues'] = eigValues1 # characteristic value
contribution['Contribution ratio'] = contributionRate(eigValues1)[0]
contribution['Cumulative contribution ratio'] = contributionRate(eigValues1)[1]
contribution
eigValuesContribution ratioCumulative contribution ratio
02.85671157.13422057.134220
10.80916416.18327473.317494
20.3429506.85899980.176493
30.53967510.79350590.969998
40.4515009.030002100.000000

Because of the k k Contribution ratio of k principal components= λ k ∑ i = 1 5 λ i \frac{\lambda_k}{\sum_{i=1}^{5} \lambda_i} _i=15 λ I λ k, the contribution rates of each principal component are 57.1%, 16.2%, 6.86%, 10.80%, 9.03%. The cumulative contribution of the first two principal components was 73.3%.

The meaning of the first principal component is the weighted sum of all the indicators. When the first principal component is large, it can be inferred that each rebound rate is large, so it can be called the "total rebound rate" factor. When the second principal component is larger, the rebound rate is smaller and larger, so the second principal component can be called the "oil stock rebound rate" factor.

Question 2

Field information:

indexMeaning
X 1 X_1 X1​Food expenditure per capita
X 2 X_2 X2​Food expenditure per capita
X 3 X_3 X3​Per capita expenditure on tobacco, alcohol and tea
X 4 X_4 X4​Other side food expenditure per capita
X 5 X_5 X5​Cost per capita on clothing goods
X 6 X_6 X6​Per capita commodity expenditure
X 7 X_7 X7​Per capita fuel expenditure
X 8 X_8 X8​Per capita non-commodity expenditure

Question 1: Find the correlation coefficient matrix R

# Data Preprocessing
data = pd.read_excel('4.2.xlsx',index_col=0)
# 0 Mean Normalization
data = (data - data.mean()) / data.std()
#Correlation matrix
R2 = data.corr()
R2
X1X2X3X4X5X6X7X8
X11.0000000.333647-0.054539-0.061254-0.2893610.1987960.3486980.318677
X20.3336471.000000-0.0229020.398931-0.1563040.7111340.4135950.834952
X3-0.054539-0.0229021.0000000.5333290.4967630.032830-0.139086-0.258358
X4-0.0612540.3989310.5333291.0000000.6984240.467917-0.1712740.312757
X5-0.289361-0.1563040.4967630.6984241.0000000.280129-0.208277-0.081234
X60.1987960.7111340.0328300.4679170.2801291.0000000.4168210.701586
X70.3486980.413595-0.139086-0.171274-0.2082770.4168211.0000000.398868
X80.3186770.834952-0.2583580.312757-0.0812340.7015860.3988681.000000

From the table above, we can see that there is a strong relationship between variables, so it is suitable to do principal component analysis and factor analysis to achieve the purpose of dimension reduction.

Question 2: Principal component analysis

# Computing eigenvalues and eigenvectors
eigValues2, eigVects2 = np.linalg.eig(R2) 
eig = pd.DataFrame() # Create a data frame with variable names and eigenvalues
eig['names'] = data.columns # Column Name
eig['eigValues'] = eigValues2 # characteristic value
eig
nameseigValues
0X13.096288
1X22.367223
2X30.919987
3X40.705925
4X50.498442
5X60.051470
6X70.130776
7X80.229888
# Determine the number of common factors
from math import sqrt
for k in range(1, 9): 
    if eig['eigValues'][:k].sum() / eig['eigValues'].sum() >= 0.8: # If the explanatory power reaches 80%, end the cycle
        print(k)
        break
4

Because of the k k Contribution ratio of k principal components= λ k ∑ i = 1 8 λ i \frac{\lambda_k}{\sum_{i=1}^{8} \lambda_i} ∑i=18​λi​λk​​

# Calculate contribution ratio
contribution = pd.DataFrame() # Create a data frame with variable names and eigenvalues
contribution['names'] = data.columns # Column Name
contribution['eigValues'] = eigValues2 # characteristic value
contribution['Contribution ratio'] = contributionRate(eigValues2)[0]
contribution['Cumulative contribution ratio'] = contributionRate(eigValues2)[1]
contribution
nameseigValuesContribution ratioCumulative contribution ratio
0X13.09628838.70360438.703604
1X22.36722329.59028868.293892
2X30.91998711.49984279.793734
3X40.7059258.82406788.617801
4X50.4984426.23052994.848330
5X60.0514700.64336995.491699
6X70.1307761.63469797.126396
7X80.2298882.873604100.000000

Question 3: Factor analysis

# Constructing the initial factor load matrix
col0 = list(sqrt(eigValues2[0]) * eigVects2[:,0]) #Factor Load Matrix Column 1
col1 = list(sqrt(eigValues2[1]) * eigVects2[:,1]) #Factor Load Matrix Column 2
col2 = list(sqrt(eigValues2[2]) * eigVects2[:,2]) #Factor Load Matrix Column 3
col3 = list(sqrt(eigValues2[3]) * eigVects2[:,3]) #Factor Load Matrix Column 4
A = pd.DataFrame([col0, col1, col2, col3]).T #Construction factor load matrix A
A.columns=['X1','X2','X3','X4'] #Common factor of factor load matrix A
A
X1X2X3X4
00.4392150.371164-0.6655780.316560
10.9136590.0578620.0683510.188935
2-0.032518-0.731499-0.554221-0.027204
30.447107-0.8278790.0208870.194140
40.038175-0.8853730.046123-0.239764
50.866903-0.2072090.139412-0.188390
60.5580600.401080-0.274694-0.645366
70.8962340.1339820.2602000.148706
# Establish Factor Model
h = np.zeros(8) #Variable commonality, reflecting the dependence of variables on common factors, approaches 1, indicating that the higher the degree of explanations of common factors, the better the effect of factor analysis
D = np.mat(np.eye(8))#Special factor variance, the degree of variance contribution of a factor, reflecting the contribution of a common factor to a variable, and measuring the relative importance of a common factor
A = np.mat(A) #Matrix Factor Load Matrix A
for i in range(8):
    a = A[i,:] * A[i,:].T #Sum of row squares of elements of A
    h[i] = a[0,0]  #Calculates the commonality of the variable X, describing all the common factors F to the variable X_ The contribution of the total variance of i, and the variable X_i The part of the variance that can be explained by all factors
    D[i,i] = 1 - a[0,0] #Because the normalized variance of the independent variable matrix is 1, that is, Var(X_i)=the first common degree h_i +Variance of Special Factor I I

Linear combination of factors as variables

def varimax(Phi, gamma = 1.0, q =10, tol = 1e-6): #Define the maximum variance rotation function
    p, k = Phi.shape #Gives the total number of rows and columns of the matrix Phi
    R = np.eye(k) #Given a unit matrix of k*k
    d = 0
    for i in range(q):
        d_old = d
        Lambda = np.dot(Phi, R)#Matrix Multiplication
        u, s, vh = np.linalg.svd(np.dot(Phi.T, np.asarray(Lambda) ** 3 - (gamma / p) * np.dot(Lambda, np.diag(np.diag(np.dot(Lambda.T,Lambda)))))) #Singular Value Decomposition svd
        R = np.dot(u,vh)#Constructing Orthogonal Matrix R
        d = np.sum(s)#Sum of singular values

    if d_old != 0 and d / d_old:
        return np.dot(Phi, R)#Returns the rotation matrix Phi*R

rotation_mat = varimax(A) # Calling the maximum variance rotation function
rotation_mat = pd.DataFrame(rotation_mat) # Data Frame
rotation_mat
0123
00.2031500.062124-0.892652-0.178661
10.889045-0.029496-0.262154-0.135988
2-0.214454-0.871857-0.1945650.008240
30.481798-0.7920820.0773240.240652
40.033408-0.8197820.4134440.029366
50.804396-0.2743440.064507-0.350578
60.2554160.127864-0.165170-0.924869
70.9333360.109739-0.110305-0.125212
# Calculating factor score
data_ = np.mat(data) #Matrix processing
factor_score = (data_).dot(A) #Calculating factor score
factor_score = pd.DataFrame(factor_score)#Data Frame
factor_score.columns = ['X1','X2','X3','X4'] #Naming factor variables
factor_score.index = data.index
factor_score['sum'] = factor_score.sum(axis=1)

Sorting results

factor_score = factor_score.sort_values(by = ['sum'], ascending=False)
factor_score
X1X2X3X4sum
region
Guangdong12.1342233.505384-0.780327-1.32052313.538757
Hainan2.4655224.890210-1.5672511.6250747.413554
Guangxi1.8590721.9005151.539035-0.0632275.235395
Shanghai5.716003-3.9402151.6035531.6667365.046078
Fujian2.0304342.114319-0.7991410.6065313.952143
Hunan-0.9115140.8497111.107383-0.1877680.857812
Jiangxi-3.4504783.2437350.989421-0.0753770.707301
Sichuan-0.237351-0.5287001.1219040.1339220.489775
Tianjin0.766193-0.729420-0.6471831.0445250.434115
Beijing3.153503-4.4435511.1858590.4577010.353514
Yunnan-1.1731420.454616-0.2154000.9628590.028933
Xinjiang-1.4402610.6481651.237420-0.4324770.012847
Jiangsu0.269736-0.1741600.491318-0.609942-0.023048
Hubei-0.2999190.907169-0.123593-0.584738-0.101081
Shaanxi-1.0781880.4347500.0374840.254255-0.351699
Zhejiang2.665959-2.113285-0.145554-0.905556-0.498436
Guizhou-2.2159740.6589340.2940980.350311-0.912631
Anhui-1.9619100.677642-0.2103940.521765-0.972897
HEBEI-0.6901230.454840-0.414944-0.635875-1.286102
Jilin-2.2764311.3236830.105062-0.566823-1.414510
Liaoning0.079538-1.5086760.178047-0.325763-1.576853
Shandong-0.248310-1.049270-0.064522-0.451012-1.813114
Ningxia-0.757336-0.9934620.136768-0.373849-1.987879
Gansu-2.0802810.296880-0.7267230.027185-2.482939
Inner Mongolia-2.2139520.203457-0.289416-0.190353-2.490265
Henan-2.6147180.540896-0.7024270.131479-2.644770
Shanxi-2.964064-0.2576120.1559190.194874-2.870882
Heilongjiang-2.3326890.1575370.222618-1.009044-2.961577
Qinghai-1.9590880.027927-1.4774250.137006-3.271581
Tibet-0.234453-7.552021-2.241588-0.381896-10.409958

The first principal component is all indicators except the weighting of per capita non-staple food expenditure. When the first principal component is large, it can be inferred that all kinds of daily life expenditure are large, so it can be called "daily life expenditure" factor. When the second principal component is larger, the per capita expenditure on food, by-food, fuel and non-commodities is smaller, while the per capita expenditure on tobacco, alcohol, tea, other by-food, clothing and commodities is larger, so the second principal component can be called the "unnecessary expenditure" factor.

Factor analysis is a multivariate statistical analysis method to study how to concentrate many original variables into a few factors with minimal loss of information, and how to make factors have some explanatory nomenclature.
Its matrix form is X = A F + ε X=AF+\varepsilon X=AF+ε

KMO spherical test should be carried out before factor analysis of data. The P value = 0.000 is very significant, so it meets the precondition of factor analysis.

Question 3

Table 3 below shows the head length (X1), head width (X2) and head length (Y1) and head width (Y2) of adult offspring from 25 families. Try from the sample covariance matrix, respectively. Σ Starting from the sample correlation coefficient matrix R, do the canonical correlation analysis to find out the typical variable pairs and the canonical correlation coefficients, and test whether the typical variable pairs are significantly related (=0.05). What are the differences between the results in the two cases?

# Guide Pack
df = pd.read_excel('4.3.xlsx')
# input
X = df[['X1', 'X2']]
y = df[['Y1', 'Y2']]
# Modeling
import scipy
from sklearn.cross_decomposition import CCA
def cca(X, Y, k): 
    l = len(k)
    reu = []
    for i in range(l):
        cca = CCA(n_components=k[i])
        # Training data
        cca.fit(X, y)
        # Dimension Reduction Operation
        X_train_r, y_train_r = cca.transform(X, y)
        reu.append(scipy.stats.pearsonr(X_train_r[:, k[i]-1], y_train_r[:, k[i]-1])) # Output correlation coefficient
    return reu
analysis = pd.DataFrame()
r = cca(X, y, [1, 2])
analysis['corr'] = r[0]
analysis['p-value'] = r[1]
analysis
corrp-value
00.7885080.053740
10.0000030.798626

The canonical correlation coefficient of the first pair of canonical correlation variables is 0.789 and the P value is 0.056 for both standardized and non-standardized data. The second pair of canonical correlation variables has a canonical correlation coefficient of 0.0003 and a P value of 0.797, which makes it obvious that the second pair of canonical variables cannot provide information about X and Y.

Topics: Python Data Analysis