# Catalogue of series articles

Near infrared spectroscopy is a cross cutting field, which requires cooperation in chemistry, computer science, bioscience and other fields. To this end, Under the guidance of (teacher Yang Huihua team of Beijing University of Posts and Telecommunications), we will prepare to open source the traditional classical algorithms such as PLS, SVM, ANN and RF, preprocessing such as SG, MSC, first-order derivative and second-order derivative, wavelength selection algorithms such as GA and the latest deep learning algorithms such as CNN and AE, so as to help other majors easier to establish near-infrared spectrum models with good prediction ability and robustness.

# preface

NIRS is an electromagnetic wave between visible light and mid infrared light, The wavelength range is (1100 ∼ 2526 nm The frequency combination of (OH, NH, CH, SH) vibration is consistent with the absorption region of all levels of frequency doubling. The characteristic information of hydrogen containing groups of organic molecules in the sample can be obtained by scanning the near-infrared spectrum of the sample, which is often used as an effective carrier to obtain the sample information. The detection method based on NIRS has the advantages of convenience, high efficiency, accuracy, low cost, on-site detection and no damage to the sample It is widely used in various detection fields. However, near infrared spectroscopy has some problems, such as spectral bandwidth, serious overlap, weak absorption signal and complex information analysis. Different from common chemical analysis methods, it can only be used as an indirect measurement method, and can not directly analyze the content or category of the measured samples. It depends on chemical metrology methods, A correlation model (or Calibration Model) is established between the attribute value to be measured and the near-infrared spectrum data of the sample, and then the predicted value of each property component is obtained by predicting the near-infrared spectrum of the unknown sample through the model. The existing near-infrared modeling methods mainly include classical modeling (preprocessing + wavelength screening for feature dimensionality reduction, and then modeling through pls and svm algorithms) and deep learning methods (end-to-end modeling, which has little dependence on preprocessing and wavelength selection)This article mainly describes the common wavelength selection algorithm (it is currently in the matlab version, and the python version has time to rewrite it)

# 1, SPA algorithm

Continuous projection algorithm (successful projections algorithm, SPA) is a forward feature variable selection method. Spa uses vector projection analysis to project the wavelength onto other wavelengths, compare the size of the projection vector, take the wavelength with the largest projection vector as the wavelength to be selected, and then select the final feature wavelength based on the correction model. Spa selects the variable with the least redundant information and the least collinearity Quantity combination.

function chain = projections_qr(X,k,M) % Projections routine for the Successive Projections Algorithm using the % built-in QR function of Matlab % % chain = projections(X,k,M) % % X --> Matrix of predictor variables (# objects N x # variables K) % k --> Index of the initial column for the projection operations % M --> Number of variables to include in the chain % % chain --> Index set of the variables resulting from the projection operations X_projected = X; norms = sum(X_projected.^2); % Square norm of each column vector norm_max = max(norms); % Norm of the "largest" column vector X_projected(:,k) = X_projected(:,k)*2*norm_max/norms(k); % Scales the kth column so that it becomes the "largest" column [dummy1,dummy2,order] = qr(X_projected,0); chain = order(1:M)';

# 2, UVE algorithm

The uninformative variable elimination (UVE) algorithm can remove the wavelength variables with low common efficiency for modeling, select the characteristic wavelength variables, and the removed wavelength variables are called uninformative variables. The establishment of uninformative variable removal algorithm is based on partial least squares (PLS) algorithm. Removing uninformative variables reduces the number of variables used in modeling and reduces the complexity of the model. In order to select uninformative variables, UVE algorithm adds a group of white noise variables with the same number of original variables to PLS model, and then obtains the regression coefficient corresponding to each variable, including noise variables, based on the cross leave one method of PLS model. Divide the stable value of each variable coefficient by the standard deviation, compare their quotient with the stable value obtained from the random variable matrix, and delete those wavelength variables that are as invalid for modeling as random variables.

function [B,C,P,T,U,R,R2X,R2Y]=plssim(X,Y,A,S,XtX); [n,px] = size(X); [n,m] = size(Y); % size of the input data matrices if nargin<5, S = []; end, if isempty(S), S=(Y'*X)'; end % if XtX not inputted, S=[]; always when S=[] then S=(Y'*X)' if nargin<4, XtX=[]; end % if S is not inputted, XtX=[]; if isempty(XtX) & n>3*px, XtX = X'*X; end % when XtX=[] and X is very "tall", the booster XtX is calculated if nargin<3, A=10; end, A = min([A px n-1]); % if A is not inputted, then the defaul A is min[10 px n-1] T = zeros(n ,A); U = T; % initialization of variables R = zeros(px,A); P = R; V = R; C = zeros(m ,A); R2Y = zeros(1,A); z = zeros(m,1); v = zeros(px,1); if n>px, S0 = S; end, StS = S'*S; % SIMPLS algorithm nm1 = n-1; tol = 0; for a = 1:A StS = StS-z*z'; [Q,LAMBDA] = eig(StS); [lambda,j] = max(diag(LAMBDA)); q = Q(:,j(1)); r = S*q; t = X*r; if isempty(XtX), p = (t'*X)'; else p = XtX*r; end if n>px, d = sqrt(r'*p/nm1); else d = sqrt(t'*t/nm1); end if d<tol, disp(' ') disp('WARNING: the required number of factors (A) is too high !') disp('Less PLS factors were extracted from the data in the PLSSIM program !') disp(' ') break, , else tol=max(tol,d/1e5); end v = p-V(:,1:max(1,a-1))*(p'*V(:,1:max(1,a-1)))'; v = v/sqrt(v'*v); z = (v'*S)'; S = S-v*z'; % save results V(:,a) = v; R(:,a) = r/d; % X weights P(:,a) = p/(d*nm1); % X loadings T(:,a) = t/d; % X scores U(:,a) = Y*q; % Y scores C(:,a) = q*(lambda(1)/(nm1*d)); % Y loadings R2Y(1,a) = lambda(1)/d; % Y-variance accounted for end clear StS V LAMBDA Q p q r t v z; if d<tol, A=a-1; a=A; T=T(:,1:A); U=U(:,1:A); R=R(:,1:A); P=P(:,1:A); C=C(:,1:A); end while a>1 U(:,a) = U(:,a)-T(:,1:a-1)*(U(:,a)'*T(:,1:a-1)/nm1)'; a=a-1; end B = R*C'; % B-coefficients of the regression Y on X if isempty(XtX), sumX2=sum(X.^2); else sumX2 = sum(diag(XtX)); end R2X = 100*nm1/sum(sumX2)*cumsum(sum(P.^2)); R2Y = 100/nm1/sum(sum(Y.^2))*cumsum(R2Y(1:A).^2);

# 3, LAR algorithm

LAR(Least Angel Regression), a variable selection method proposed by Efron in 2004, is similar to the form of Forward Stepwise regression. It is an efficient solution of lasso region. The difference of Forward Stepwise regression is that each time Forward Stepwise completely fits the linear model and calculates RSS according to the selected subset of variables, Redesign statistics (such as AIC) punish the higher model complexity, and LAR is to find the variable with the highest correlation with the dependent variable each time, and then adjust the coefficient of the predictor a little bit along the direction of LSE. In this process, the correlation coefficient between the variable and the residual will gradually decrease. When the correlation is not so significant, it is necessary to select a new variable with the highest correlation And then change along the direction of LSE again. In the end, all variables are selected, which is the same as LSE.

function [b info] = lar(X, y, stop, storepath, verbose) %% Input checking % Set default values. if nargin < 5 verbose = false; end if nargin < 4 storepath = true; end if nargin < 3 stop = 0; end if nargin < 2 error('SpaSM:lar', 'Input arguments X and y must be specified.'); end %% LARS variable setup [n p] = size(X); maxVariables = min(n-1,p); % Maximum number of active variables useGram = false; % if n is approximately a factor 10 bigger than p it is faster to use a % precomputed Gram matrix rather than Cholesky factorization when solving % the partial OLS Partial least square method problem. Make sure the resulting Gram matrix is not % prohibitively large. if (n/p) > 10 && p < 1000 useGram = true; Gram = X'*X; end % set up the LAR coefficient vector if storepath b = zeros(p, p+1); else b = zeros(p, 1); b_prev = b; end mu = zeros(n, 1); % current "position" as LARS travels towards lsq solution I = 1:p; % inactive set A = []; % active set if ~useGram R = []; % Cholesky factorization R'R = X'X where R is upper triangular end stopCond = 0; % Early stopping condition boolean step = 1; % step count if verbose fprintf('Step\tAdded\tActive set size\n'); end %% LARS main loop % while not at OLS solution or early stopping criterion is met while length(A) < maxVariables && ~stopCond r = y - mu; % find max correlation c = X(:,I)'*r; % X Correlation coefficient between each dimension and the current residual [cmax cidx] = max(abs(c)); % add variable if ~useGram R = cholinsert(R,X(:,I(cidx)),X(:,A)); end if verbose fprintf('%d\t\t%d\t\t%d\n', step, I(cidx), length(A) + 1); end A = [A I(cidx)]; I(cidx) = []; c(cidx) = []; % After deleting the original item, the correlation coefficients of other items (i.e. deleted) cidx This one-dimensional correlation coefficient (because the second cidx Vega's in active set))，Fill in the following items. The same is true for the previous line % partial OLS solution and direction from current position to the OLS % solution of X_A if useGram b_OLS = Gram(A,A)\(X(:,A)'*y); % same as X(:,A)\y, but faster else % b_OLS = R\(R'\(X(:,A)'*y)); % same as X(:,A)\y, but faster b_OLS = X(:,A)\y; %\yes matlab Inside the left division. To ask (take your question as an example) X*a=y The (least squares) solution of this system of linear equations is sometimes warned during operation: 'The matrix is singular working accuracy'，So replace the above line with this form, although the operation will be slower. Refer to: http://www.ilovematlab.cn/thread-301697-1-1.html and http://zhidao.baidu.com/link?url=GzrsyLqNSkvuryZvr4UAMpxi4PIxjpUtJ9HmJ2nL8jVadDjz5CRMbpoxfmk-mRJZ6bsvqgsPHbV7pslq248SpURQkYUmSNhYBh2VoPYSUQe end d = X(:,A)*b_OLS - mu; if isempty(I) % if all variables active, go all the way to the OLS solution gamma = 1; else % compute length of walk along equiangular direction cd = (X(:,I)'*d); gamma = [ (c - cmax)./(cd - cmax); (c + cmax)./(cd + cmax) ]; gamma = min(gamma(gamma > 0)); % take gamma Minimum value of positive part end % update beta if storepath b(A,step + 1) = b(A,step) + gamma*(b_OLS - b(A,step)); % update beta else b_prev = b; b(A) = b(A) + gamma*(b_OLS - b(A)); % update beta end % update position mu = mu + gamma*d; % increment step counter step = step + 1; % Early stopping at specified bound on L1 norm of beta if stop > 0 if storepath t2 = sum(abs(b(:,step))); if t2 >= stop t1 = sum(abs(b(:,step - 1))); s = (stop - t1)/(t2 - t1); % interpolation factor 0 < s < 1 b(:,step) = b(:,step - 1) + s*(b(:,step) - b(:,step - 1)); stopCond = 1; end else t2 = sum(abs(b)); if t2 >= stop t1 = sum(abs(b_prev)); s = (stop - t1)/(t2 - t1); % interpolation factor 0 < s < 1 b = b_prev + s*(b - b_prev); stopCond = 1; end end end % Early stopping at specified number of variables if stop < 0 stopCond = length(A) >= -stop; end end % trim beta if storepath && size(b,2) > step b(:,step + 1:end) = []; end %% Compute auxilliary measures if nargout == 2 % only compute if asked for info.steps = step - 1; b0 = pinv(X)*y; % regression coefficients of low-bias model penalty0 = sum(abs(b0)); % L1 constraint size of low-bias model Low bias model indices = (1:p)'; if storepath % for entire path q = info.steps + 1; info.df = zeros(1,q); info.Cp = zeros(1,q); info.AIC = zeros(1,q); info.BIC = zeros(1,q); info.s = zeros(1,q); sigma2e = sum((y - X*b0).^2)/n; for step = 1:q A = indices(b(:,step) ~= 0); % active set The active set, the valid set, is the index of those numbers whose columns are not 0 % compute godness of fit measurements Cp, AIC and BIC r = y - X(:,A)*b(A,step); % residuals residual rss = sum(r.^2); % residual sum-of-squares Sum of squares of residuals info.df(step) = step - 1; info.Cp(step) = rss/sigma2e - n + 2*info.df(step); info.AIC(step) = rss + 2*sigma2e*info.df(step); info.BIC(step) = rss + log(n)*sigma2e*info.df(step); info.s(step) = sum(abs(b(A,step)))/penalty0; end else % for single solution info.s = sum(abs(b))/penalty0; info.df = info.steps; end

# 4, CARS algorithm

Competitive adaptive reweighted sampling (CARS) is a characteristic variable selection method combining Monte Carlo sampling and PLS model regression coefficient, imitating the principle of "survival of the fittest" in Darwin theory (Li et al., 2009). In CARS algorithm, adaptive weighted sampling is used every time (adaptive reweighted sampling, ARS) keep the points with large weight of the absolute value of the regression coefficient in the PLS model as a new subset, remove the points with small weight, and then establish the PLS model based on the new subset. After multiple calculations, select the wavelength in the subset with the smallest root mean square error (RMSECV) of the PLS model interactive verification as the characteristic wavelength

function F=carspls(X,y,A,fold,method,num) tic; %+++ Initial settings. if nargin<6;num=50;end; if nargin<5;method='center';end; if nargin<4;fold=5;end; if nargin<3;A=2;end; %+++ Initial settings. [Mx,Nx]=size(X); A=min([Mx Nx A]); index=1:Nx; ratio=0.9; r0=1; r1=2/Nx; Vsel=1:Nx; Q=floor(Mx*ratio); W=zeros(Nx,num); Ratio=zeros(1,num); %+++ Parameter of exponentially decreasing function. b=log(r0/r1)/(num-1); a=r0*exp(b); %+++ Main Loop for iter=1:num perm=randperm(Mx); Xcal=X(perm(1:Q),:); ycal=y(perm(1:Q)); %+++ Monte-Carlo Sampling. PLS=pls(Xcal(:,Vsel),ycal,A,method); %+++ PLS model w=zeros(Nx,1);coef=PLS.coef_origin(1:end-1,end); w(Vsel)=coef;W(:,iter)=w; w=abs(w); %+++ weights [ws,indexw]=sort(-w); %+++ sort weights ratio=a*exp(-b*(iter+1)); %+++ Ratio of retained variables. Ratio(iter)=ratio; K=round(Nx*ratio); w(indexw(K+1:end))=0; %+++ Eliminate some variables with small coefficients. Vsel=randsample(Nx,Nx,true,w); %+++ Reweighted Sampling from the pool of retained variables. Vsel=unique(Vsel); fprintf('The %dth variable sampling finished.\n',iter); %+++ Screen output. end %+++ Cross-Validation to choose an optimal subset; RMSEP=zeros(1,num); Q2_max=zeros(1,num); Rpc=zeros(1,num); for i=1:num vsel=find(W(:,i)~=0); CV=plscvfold(X(:,vsel),y,A,fold,method,0); RMSEP(i)=CV.RMSECV; Q2_max(i)=CV.Q2_max; Rpc(i)=CV.optPC; fprintf('The %d/%dth subset finished.\n',i,num); end [Rmin,indexOPT]=min(RMSEP); Q2_max=max(Q2_max); %+++ save results; time=toc; %+++ output F.W=W; F.time=time; F.cv=RMSEP; F.Q2_max=Q2_max; F.minRMSECV=Rmin; F.iterOPT=indexOPT; F.optPC=Rpc(indexOPT); Ft.ratio=Ratio; F.vsel=find(W(:,indexOPT)~=0)'; function sel=weightsampling_in(w) %Bootstrap sampling %2007.9.6,H.D. Li. w=w/sum(w); N1=length(w); min_sec(1)=0; max_sec(1)=w(1); for j=2:N1 max_sec(j)=sum(w(1:j)); min_sec(j)=sum(w(1:j-1)); end % figure;plot(max_sec,'r');hold on;plot(min_sec); for i=1:N1 bb=rand(1); ii=1; while (min_sec(ii)>=bb | bb>max_sec(ii)) & ii<N1; ii=ii+1; end sel(i)=ii; end % w is related to the bootstrap chance %+++ subfunction: booststrap sampling % function sel=bootstrap_in(w); % V=find(w>0); % L=length(V); % interval=linspace(0,1,L+1); % for i=1:L; % rn=rand(1); % k=find(interval<rn); % sel(i)=V(k(end)); % end

# summary

The complete code is available from GitHub warehouse If it is useful to you, please like it!

The code is for academic use only. If you have any questions, contact: wechat: Fu_siry