Principal Component Analysis and Factor Analysis

Posted by Irresistable on Tue, 02 Nov 2021 19:35:48 +0100

Question 1

# Guide Pack
import numpy as np
import pandas as pd

# Incoming Matrix Parameter
R1 = np.array([[1, 0.577, 0.509, 0.387, 0.462], [0.577, 1, 0.599, 0.389, 0.322], [0.509, 0.599, 1, 0.436, 0.426], [0.387, 0.389, 0.436, 1, 0.523], [0.462, 0.322, 0.426, 0.523, 1]])

# Finding the parameters and eigenvectors of a matrix
eigValues1, eigVects1 = np.linalg.eig(R1)   
eigValues1, eigVects1

(array([2.85671099, 0.80916372, 0.34294995, 0.53967524, 0.4515001 ]),
 array([[ 0.46360519,  0.24033901,  0.45126217,  0.61170545, -0.38663456],
        [ 0.45710783,  0.50930473, -0.67622331, -0.17818949, -0.20647436],
        [ 0.47017564,  0.26044828,  0.40000721, -0.33505645,  0.66244469],
        [ 0.42145876, -0.5256649 ,  0.17559859, -0.54076277, -0.47200602],
        [ 0.42122445, -0.58196989, -0.38502448,  0.43517554,  0.38243876]]))

Set No. i i i Principal Component Y i = ∑ k = 1 n e k X k Y_i=\sum_{k=1}^{n} e_kX_k Yi =k=1n ek Xk. Based on mathematical derivation, the orthogonal unit feature vector corresponding to the eigenvalues from large to small is the combination coefficient of the corresponding principal components, and the resulting principal components satisfy Y 1 Y_1 Y1 Contains as much information as possible, Y i ( 2 ≤ i ≤ n ) Y_i(2 \leq i \leq n) Yi (2 < i < n) contains as much as possible except Y k ( k ≤ i − 1 ) Y_k(k \leq i-1) Information other than Yk (k < i_1) is equal to the variance of each principal component.

Making principal component analysis from correlation matrix is equivalent to making principal component analysis from the covariance matrix of standardized data, and the eigenvalues and eigenvectors are:

characteristic value	feature vector
λ 1 \lambda_1 λ1=2.857	e 1 e_1 e1=(0.467,0.457,0.470,0.421,0.421)
λ 2 \lambda_2 λ2=0.809	e 2 e_2 e2=(0.240,0.509,0.260,-0.526,-0.581)
λ 3 \lambda_3 λ3=0.540	e 3 e_3 e3=(0.612,-0.178,-0.335,-0.541,0.435)
λ 4 \lambda_4 λ4=0.452	e 4 e_4 e4=(-0.387,-0.206,0.662,-0.472,0.382)
λ 5 \lambda_5 λ5=0.343	e 5 e_5 e5=(0.451,-0.676,0.400,0.176,-0.385)

therefore X ∗ X^{*} The five principal components of X_are:
Y k = e k X ∗ Y_k=e_kX^{*} Yk=ekX∗

# Calculate contribution ratio
def contributionRate(values):
    v = list(values)
    l = len(v)
    result = np.array([v[i]*100 / sum(v) for i in range(l)])
    add = np.empty([l, ])
    add[0] = result[0]
    for i in range(1, l):
        add[i] = add[i-1] + result[i]
    return [result, add]

contribution = pd.DataFrame() # Create a data frame with variable names and eigenvalues
contribution['eigValues'] = eigValues1 # characteristic value
contribution['Contribution ratio'] = contributionRate(eigValues1)[0]
contribution['Cumulative contribution ratio'] = contributionRate(eigValues1)[1]
contribution

	eigValues	Contribution ratio	Cumulative contribution ratio
0	2.856711	57.134220	57.134220
1	0.809164	16.183274	73.317494
2	0.342950	6.858999	80.176493
3	0.539675	10.793505	90.969998
4	0.451500	9.030002	100.000000

Because of the k k Contribution ratio of k principal components= λ k ∑ i = 1 5 λ i \frac{\lambda_k}{\sum_{i=1}^{5} \lambda_i} _i=15 λ I λ k, the contribution rates of each principal component are 57.1%, 16.2%, 6.86%, 10.80%, 9.03%. The cumulative contribution of the first two principal components was 73.3%.

The meaning of the first principal component is the weighted sum of all the indicators. When the first principal component is large, it can be inferred that each rebound rate is large, so it can be called the "total rebound rate" factor. When the second principal component is larger, the rebound rate is smaller and larger, so the second principal component can be called the "oil stock rebound rate" factor.

Question 2

Field information:

index	Meaning
X 1 X_1 X1	Food expenditure per capita
X 2 X_2 X2	Food expenditure per capita
X 3 X_3 X3	Per capita expenditure on tobacco, alcohol and tea
X 4 X_4 X4	Other side food expenditure per capita
X 5 X_5 X5	Cost per capita on clothing goods
X 6 X_6 X6	Per capita commodity expenditure
X 7 X_7 X7	Per capita fuel expenditure
X 8 X_8 X8	Per capita non-commodity expenditure

Question 1: Find the correlation coefficient matrix R

# Data Preprocessing
data = pd.read_excel('4.2.xlsx',index_col=0)

# 0 Mean Normalization
data = (data - data.mean()) / data.std()

#Correlation matrix
R2 = data.corr()
R2

	X1	X2	X3	X4	X5	X6	X7	X8
X1	1.000000	0.333647	-0.054539	-0.061254	-0.289361	0.198796	0.348698	0.318677
X2	0.333647	1.000000	-0.022902	0.398931	-0.156304	0.711134	0.413595	0.834952
X3	-0.054539	-0.022902	1.000000	0.533329	0.496763	0.032830	-0.139086	-0.258358
X4	-0.061254	0.398931	0.533329	1.000000	0.698424	0.467917	-0.171274	0.312757
X5	-0.289361	-0.156304	0.496763	0.698424	1.000000	0.280129	-0.208277	-0.081234
X6	0.198796	0.711134	0.032830	0.467917	0.280129	1.000000	0.416821	0.701586
X7	0.348698	0.413595	-0.139086	-0.171274	-0.208277	0.416821	1.000000	0.398868
X8	0.318677	0.834952	-0.258358	0.312757	-0.081234	0.701586	0.398868	1.000000

From the table above, we can see that there is a strong relationship between variables, so it is suitable to do principal component analysis and factor analysis to achieve the purpose of dimension reduction.

Question 2: Principal component analysis

# Computing eigenvalues and eigenvectors
eigValues2, eigVects2 = np.linalg.eig(R2) 
eig = pd.DataFrame() # Create a data frame with variable names and eigenvalues
eig['names'] = data.columns # Column Name
eig['eigValues'] = eigValues2 # characteristic value
eig

	names	eigValues
0	X1	3.096288
1	X2	2.367223
2	X3	0.919987
3	X4	0.705925
4	X5	0.498442
5	X6	0.051470
6	X7	0.130776
7	X8	0.229888

# Determine the number of common factors
from math import sqrt
for k in range(1, 9): 
    if eig['eigValues'][:k].sum() / eig['eigValues'].sum() >= 0.8: # If the explanatory power reaches 80%, end the cycle
        print(k)
        break

Because of the k k Contribution ratio of k principal components= λ k ∑ i = 1 8 λ i \frac{\lambda_k}{\sum_{i=1}^{8} \lambda_i} ∑i=18λiλk

# Calculate contribution ratio
contribution = pd.DataFrame() # Create a data frame with variable names and eigenvalues
contribution['names'] = data.columns # Column Name
contribution['eigValues'] = eigValues2 # characteristic value
contribution['Contribution ratio'] = contributionRate(eigValues2)[0]
contribution['Cumulative contribution ratio'] = contributionRate(eigValues2)[1]
contribution

	names	eigValues	Contribution ratio	Cumulative contribution ratio
0	X1	3.096288	38.703604	38.703604
1	X2	2.367223	29.590288	68.293892
2	X3	0.919987	11.499842	79.793734
3	X4	0.705925	8.824067	88.617801
4	X5	0.498442	6.230529	94.848330
5	X6	0.051470	0.643369	95.491699
6	X7	0.130776	1.634697	97.126396
7	X8	0.229888	2.873604	100.000000

Question 3: Factor analysis

# Constructing the initial factor load matrix
col0 = list(sqrt(eigValues2[0]) * eigVects2[:,0]) #Factor Load Matrix Column 1
col1 = list(sqrt(eigValues2[1]) * eigVects2[:,1]) #Factor Load Matrix Column 2
col2 = list(sqrt(eigValues2[2]) * eigVects2[:,2]) #Factor Load Matrix Column 3
col3 = list(sqrt(eigValues2[3]) * eigVects2[:,3]) #Factor Load Matrix Column 4
A = pd.DataFrame([col0, col1, col2, col3]).T #Construction factor load matrix A
A.columns=['X1','X2','X3','X4'] #Common factor of factor load matrix A
A

	X1	X2	X3	X4
0	0.439215	0.371164	-0.665578	0.316560
1	0.913659	0.057862	0.068351	0.188935
2	-0.032518	-0.731499	-0.554221	-0.027204
3	0.447107	-0.827879	0.020887	0.194140
4	0.038175	-0.885373	0.046123	-0.239764
5	0.866903	-0.207209	0.139412	-0.188390
6	0.558060	0.401080	-0.274694	-0.645366
7	0.896234	0.133982	0.260200	0.148706

# Establish Factor Model
h = np.zeros(8) #Variable commonality, reflecting the dependence of variables on common factors, approaches 1, indicating that the higher the degree of explanations of common factors, the better the effect of factor analysis
D = np.mat(np.eye(8))#Special factor variance, the degree of variance contribution of a factor, reflecting the contribution of a common factor to a variable, and measuring the relative importance of a common factor
A = np.mat(A) #Matrix Factor Load Matrix A

for i in range(8):
    a = A[i,:] * A[i,:].T #Sum of row squares of elements of A
    h[i] = a[0,0]  #Calculates the commonality of the variable X, describing all the common factors F to the variable X_ The contribution of the total variance of i, and the variable X_i The part of the variance that can be explained by all factors
    D[i,i] = 1 - a[0,0] #Because the normalized variance of the independent variable matrix is 1, that is, Var(X_i)=the first common degree h_i +Variance of Special Factor I I

Linear combination of factors as variables

def varimax(Phi, gamma = 1.0, q =10, tol = 1e-6): #Define the maximum variance rotation function
    p, k = Phi.shape #Gives the total number of rows and columns of the matrix Phi
    R = np.eye(k) #Given a unit matrix of k*k
    d = 0
    for i in range(q):
        d_old = d
        Lambda = np.dot(Phi, R)#Matrix Multiplication
        u, s, vh = np.linalg.svd(np.dot(Phi.T, np.asarray(Lambda) ** 3 - (gamma / p) * np.dot(Lambda, np.diag(np.diag(np.dot(Lambda.T,Lambda)))))) #Singular Value Decomposition svd
        R = np.dot(u,vh)#Constructing Orthogonal Matrix R
        d = np.sum(s)#Sum of singular values

    if d_old != 0 and d / d_old:
        return np.dot(Phi, R)#Returns the rotation matrix Phi*R

rotation_mat = varimax(A) # Calling the maximum variance rotation function
rotation_mat = pd.DataFrame(rotation_mat) # Data Frame

rotation_mat

	0	1	2	3
0	0.203150	0.062124	-0.892652	-0.178661
1	0.889045	-0.029496	-0.262154	-0.135988
2	-0.214454	-0.871857	-0.194565	0.008240
3	0.481798	-0.792082	0.077324	0.240652
4	0.033408	-0.819782	0.413444	0.029366
5	0.804396	-0.274344	0.064507	-0.350578
6	0.255416	0.127864	-0.165170	-0.924869
7	0.933336	0.109739	-0.110305	-0.125212

# Calculating factor score
data_ = np.mat(data) #Matrix processing
factor_score = (data_).dot(A) #Calculating factor score
factor_score = pd.DataFrame(factor_score)#Data Frame
factor_score.columns = ['X1','X2','X3','X4'] #Naming factor variables
factor_score.index = data.index
factor_score['sum'] = factor_score.sum(axis=1)

Sorting results

factor_score = factor_score.sort_values(by = ['sum'], ascending=False)
factor_score

	X1	X2	X3	X4	sum
region
Guangdong	12.134223	3.505384	-0.780327	-1.320523	13.538757
Hainan	2.465522	4.890210	-1.567251	1.625074	7.413554
Guangxi	1.859072	1.900515	1.539035	-0.063227	5.235395
Shanghai	5.716003	-3.940215	1.603553	1.666736	5.046078
Fujian	2.030434	2.114319	-0.799141	0.606531	3.952143
Hunan	-0.911514	0.849711	1.107383	-0.187768	0.857812
Jiangxi	-3.450478	3.243735	0.989421	-0.075377	0.707301
Sichuan	-0.237351	-0.528700	1.121904	0.133922	0.489775
Tianjin	0.766193	-0.729420	-0.647183	1.044525	0.434115
Beijing	3.153503	-4.443551	1.185859	0.457701	0.353514
Yunnan	-1.173142	0.454616	-0.215400	0.962859	0.028933
Xinjiang	-1.440261	0.648165	1.237420	-0.432477	0.012847
Jiangsu	0.269736	-0.174160	0.491318	-0.609942	-0.023048
Hubei	-0.299919	0.907169	-0.123593	-0.584738	-0.101081
Shaanxi	-1.078188	0.434750	0.037484	0.254255	-0.351699
Zhejiang	2.665959	-2.113285	-0.145554	-0.905556	-0.498436
Guizhou	-2.215974	0.658934	0.294098	0.350311	-0.912631
Anhui	-1.961910	0.677642	-0.210394	0.521765	-0.972897
HEBEI	-0.690123	0.454840	-0.414944	-0.635875	-1.286102
Jilin	-2.276431	1.323683	0.105062	-0.566823	-1.414510
Liaoning	0.079538	-1.508676	0.178047	-0.325763	-1.576853
Shandong	-0.248310	-1.049270	-0.064522	-0.451012	-1.813114
Ningxia	-0.757336	-0.993462	0.136768	-0.373849	-1.987879
Gansu	-2.080281	0.296880	-0.726723	0.027185	-2.482939
Inner Mongolia	-2.213952	0.203457	-0.289416	-0.190353	-2.490265
Henan	-2.614718	0.540896	-0.702427	0.131479	-2.644770
Shanxi	-2.964064	-0.257612	0.155919	0.194874	-2.870882
Heilongjiang	-2.332689	0.157537	0.222618	-1.009044	-2.961577
Qinghai	-1.959088	0.027927	-1.477425	0.137006	-3.271581
Tibet	-0.234453	-7.552021	-2.241588	-0.381896	-10.409958

The first principal component is all indicators except the weighting of per capita non-staple food expenditure. When the first principal component is large, it can be inferred that all kinds of daily life expenditure are large, so it can be called "daily life expenditure" factor. When the second principal component is larger, the per capita expenditure on food, by-food, fuel and non-commodities is smaller, while the per capita expenditure on tobacco, alcohol, tea, other by-food, clothing and commodities is larger, so the second principal component can be called the "unnecessary expenditure" factor.

Factor analysis is a multivariate statistical analysis method to study how to concentrate many original variables into a few factors with minimal loss of information, and how to make factors have some explanatory nomenclature.
Its matrix form is X = A F + ε X=AF+\varepsilon X=AF+ε

KMO spherical test should be carried out before factor analysis of data. The P value = 0.000 is very significant, so it meets the precondition of factor analysis.

Question 3

Table 3 below shows the head length (X1), head width (X2) and head length (Y1) and head width (Y2) of adult offspring from 25 families. Try from the sample covariance matrix, respectively. Σ Starting from the sample correlation coefficient matrix R, do the canonical correlation analysis to find out the typical variable pairs and the canonical correlation coefficients, and test whether the typical variable pairs are significantly related (=0.05). What are the differences between the results in the two cases?

# Guide Pack
df = pd.read_excel('4.3.xlsx')

# input
X = df[['X1', 'X2']]
y = df[['Y1', 'Y2']]

# Modeling
import scipy
from sklearn.cross_decomposition import CCA
def cca(X, Y, k): 
    l = len(k)
    reu = []
    for i in range(l):
        cca = CCA(n_components=k[i])
        # Training data
        cca.fit(X, y)
        # Dimension Reduction Operation
        X_train_r, y_train_r = cca.transform(X, y)
        reu.append(scipy.stats.pearsonr(X_train_r[:, k[i]-1], y_train_r[:, k[i]-1])) # Output correlation coefficient
    return reu

analysis = pd.DataFrame()
r = cca(X, y, [1, 2])
analysis['corr'] = r[0]
analysis['p-value'] = r[1]
analysis

	corr	p-value
0	0.788508	0.053740
1	0.000003	0.798626

The canonical correlation coefficient of the first pair of canonical correlation variables is 0.789 and the P value is 0.056 for both standardized and non-standardized data. The second pair of canonical correlation variables has a canonical correlation coefficient of 0.0003 and a P value of 0.797, which makes it obvious that the second pair of canonical variables cannot provide information about X and Y.

Topics: Python Data Analysis

Programmer Think