Spectral wavelength screening algorithm

Posted by leatherback on Mon, 03 Jan 2022 10:57:29 +0100

Catalogue of series articles

Near infrared spectroscopy is a cross cutting field, which requires cooperation in chemistry, computer science, bioscience and other fields. To this end, Under the guidance of (teacher Yang Huihua team of Beijing University of Posts and Telecommunications), we will prepare to open source the traditional classical algorithms such as PLS, SVM, ANN and RF, preprocessing such as SG, MSC, first-order derivative and second-order derivative, wavelength selection algorithms such as GA and the latest deep learning algorithms such as CNN and AE, so as to help other majors easier to establish near-infrared spectrum models with good prediction ability and robustness.

preface

NIRS is an electromagnetic wave between visible light and mid infrared light, The wavelength range is (1100 ∼ 2526 nm The frequency combination of (OH, NH, CH, SH) vibration is consistent with the absorption region of all levels of frequency doubling. The characteristic information of hydrogen containing groups of organic molecules in the sample can be obtained by scanning the near-infrared spectrum of the sample, which is often used as an effective carrier to obtain the sample information. The detection method based on NIRS has the advantages of convenience, high efficiency, accuracy, low cost, on-site detection and no damage to the sample It is widely used in various detection fields. However, near infrared spectroscopy has some problems, such as spectral bandwidth, serious overlap, weak absorption signal and complex information analysis. Different from common chemical analysis methods, it can only be used as an indirect measurement method, and can not directly analyze the content or category of the measured samples. It depends on chemical metrology methods, A correlation model (or Calibration Model) is established between the attribute value to be measured and the near-infrared spectrum data of the sample, and then the predicted value of each property component is obtained by predicting the near-infrared spectrum of the unknown sample through the model. The existing near-infrared modeling methods mainly include classical modeling (preprocessing + wavelength screening for feature dimensionality reduction, and then modeling through pls and svm algorithms) and deep learning methods (end-to-end modeling, which has little dependence on preprocessing and wavelength selection)

This article mainly describes the common wavelength selection algorithm (it is currently in the matlab version, and the python version has time to rewrite it)

1, SPA algorithm

Continuous projection algorithm (successful projections algorithm, SPA) is a forward feature variable selection method. Spa uses vector projection analysis to project the wavelength onto other wavelengths, compare the size of the projection vector, take the wavelength with the largest projection vector as the wavelength to be selected, and then select the final feature wavelength based on the correction model. Spa selects the variable with the least redundant information and the least collinearity Quantity combination.

function chain = projections_qr(X,k,M)

% Projections routine for the Successive Projections Algorithm using the
% built-in QR function of Matlab
%
% chain = projections(X,k,M)
%
% X --> Matrix of predictor variables (# objects N x # variables K)
% k --> Index of the initial column for the projection operations
% M --> Number of variables to include in the chain
%
% chain --> Index set of the variables resulting from the projection operations

X_projected = X;

norms = sum(X_projected.^2);    % Square norm of each column vector
norm_max = max(norms); % Norm of the "largest" column vector

X_projected(:,k) = X_projected(:,k)*2*norm_max/norms(k); % Scales the kth column so that it becomes the "largest" column

[dummy1,dummy2,order] = qr(X_projected,0); 
chain = order(1:M)';

2, UVE algorithm

The uninformative variable elimination (UVE) algorithm can remove the wavelength variables with low common efficiency for modeling, select the characteristic wavelength variables, and the removed wavelength variables are called uninformative variables. The establishment of uninformative variable removal algorithm is based on partial least squares (PLS) algorithm. Removing uninformative variables reduces the number of variables used in modeling and reduces the complexity of the model. In order to select uninformative variables, UVE algorithm adds a group of white noise variables with the same number of original variables to PLS model, and then obtains the regression coefficient corresponding to each variable, including noise variables, based on the cross leave one method of PLS model. Divide the stable value of each variable coefficient by the standard deviation, compare their quotient with the stable value obtained from the random variable matrix, and delete those wavelength variables that are as invalid for modeling as random variables.

function [B,C,P,T,U,R,R2X,R2Y]=plssim(X,Y,A,S,XtX);

[n,px] = size(X); [n,m] = size(Y);   				% size of the input data matrices
if nargin<5, S = []; end, if isempty(S), S=(Y'*X)'; end		% if XtX not inputted, S=[]; always when S=[] then S=(Y'*X)'
if nargin<4, XtX=[]; end					% if S is not inputted, XtX=[];
if isempty(XtX) & n>3*px, XtX = X'*X; end			% when XtX=[] and X is very "tall", the booster XtX is calculated
if nargin<3, A=10; end, A = min([A px n-1]);			% if A is not inputted, then the defaul A is min[10 px n-1]
T = zeros(n ,A); U = T;						% initialization of variables
R = zeros(px,A); P = R; V = R;
C = zeros(m ,A); 
R2Y = zeros(1,A);
z = zeros(m,1); v = zeros(px,1);
if n>px, S0 = S; end, StS = S'*S;				% SIMPLS algorithm
nm1 = n-1;
tol = 0;
for a = 1:A
  StS = StS-z*z'; 
  [Q,LAMBDA] = eig(StS); 
  [lambda,j] = max(diag(LAMBDA)); 
  q = Q(:,j(1));
  r = S*q;
  t = X*r;
  if isempty(XtX), p = (t'*X)'; else p = XtX*r; end
  if n>px, d = sqrt(r'*p/nm1); else d = sqrt(t'*t/nm1); end
  if d<tol, 
	disp(' ')
        disp('WARNING: the required number of factors (A) is too high !')
	disp('Less PLS factors were extracted from the data in the PLSSIM program !') 
	disp(' ')
	break,
, 	else tol=max(tol,d/1e5);
  end
  v = p-V(:,1:max(1,a-1))*(p'*V(:,1:max(1,a-1)))'; v = v/sqrt(v'*v); 
  z = (v'*S)'; 
  S = S-v*z'; 
								% save results
  V(:,a) = v;
  R(:,a) = r/d; 						% X weights
  P(:,a) = p/(d*nm1); 						% X loadings
  T(:,a) = t/d;							% X scores
  U(:,a) = Y*q;							% Y scores
  C(:,a) = q*(lambda(1)/(nm1*d)); 				% Y loadings
  R2Y(1,a) =  lambda(1)/d;					% Y-variance accounted for
end
clear StS V LAMBDA Q p q r t v z;
if d<tol,
 A=a-1; a=A; T=T(:,1:A); U=U(:,1:A); R=R(:,1:A); P=P(:,1:A); C=C(:,1:A);
end
while a>1
  U(:,a) = U(:,a)-T(:,1:a-1)*(U(:,a)'*T(:,1:a-1)/nm1)'; 
  a=a-1; 
end
B = R*C';							% B-coefficients of the regression Y on X
if isempty(XtX), sumX2=sum(X.^2); else sumX2 = sum(diag(XtX)); end
R2X = 100*nm1/sum(sumX2)*cumsum(sum(P.^2)); 
R2Y = 100/nm1/sum(sum(Y.^2))*cumsum(R2Y(1:A).^2);

3, LAR algorithm

LAR(Least Angel Regression), a variable selection method proposed by Efron in 2004, is similar to the form of Forward Stepwise regression. It is an efficient solution of lasso region. The difference of Forward Stepwise regression is that each time Forward Stepwise completely fits the linear model and calculates RSS according to the selected subset of variables, Redesign statistics (such as AIC) punish the higher model complexity, and LAR is to find the variable with the highest correlation with the dependent variable each time, and then adjust the coefficient of the predictor a little bit along the direction of LSE. In this process, the correlation coefficient between the variable and the residual will gradually decrease. When the correlation is not so significant, it is necessary to select a new variable with the highest correlation And then change along the direction of LSE again. In the end, all variables are selected, which is the same as LSE.

function [b info] = lar(X, y, stop, storepath, verbose)
%% Input checking
% Set default values.
if nargin < 5
  verbose = false;
end
if nargin < 4
  storepath = true;
end
if nargin < 3
  stop = 0;
end
if nargin < 2
  error('SpaSM:lar', 'Input arguments X and y must be specified.');
end

%% LARS variable setup
[n p] = size(X);
maxVariables = min(n-1,p); % Maximum number of active variables

useGram = false;
% if n is approximately a factor 10 bigger than p it is faster to use a
% precomputed Gram matrix rather than Cholesky factorization when solving
% the partial OLS Partial least square method problem. Make sure the resulting Gram matrix is not
% prohibitively large.
if (n/p) > 10 && p < 1000
  useGram = true;
  Gram = X'*X;
end

% set up the LAR coefficient vector
if storepath
  b = zeros(p, p+1);
else
  b = zeros(p, 1);
  b_prev = b;
end

mu = zeros(n, 1); % current "position" as LARS travels towards lsq solution

I = 1:p; % inactive set
A = []; % active set
if ~useGram
  R = []; % Cholesky factorization R'R = X'X where R is upper triangular
end

stopCond = 0; % Early stopping condition boolean
step = 1; % step count

if verbose
  fprintf('Step\tAdded\tActive set size\n');
end

%% LARS main loop
% while not at OLS solution or early stopping criterion is met
while length(A) < maxVariables && ~stopCond
  r = y - mu;

  % find max correlation
  c = X(:,I)'*r; % X Correlation coefficient between each dimension and the current residual
  [cmax cidx] = max(abs(c));

  % add variable
  if ~useGram
    R = cholinsert(R,X(:,I(cidx)),X(:,A));
  end
  if verbose
    fprintf('%d\t\t%d\t\t%d\n', step, I(cidx), length(A) + 1);
  end
  A = [A I(cidx)];
  I(cidx) = [];
  c(cidx) = []; % After deleting the original item, the correlation coefficients of other items (i.e. deleted) cidx This one-dimensional correlation coefficient (because the second cidx Vega's in active set))，Fill in the following items. The same is true for the previous line

  % partial OLS solution and direction from current position to the OLS
  % solution of X_A
  if useGram
    b_OLS = Gram(A,A)\(X(:,A)'*y); % same as X(:,A)\y, but faster
  else
%     b_OLS = R\(R'\(X(:,A)'*y)); % same as X(:,A)\y, but faster
    b_OLS = X(:,A)\y; %\yes matlab Inside the left division. To ask (take your question as an example) X*a=y The (least squares) solution of this system of linear equations is sometimes warned during operation: 'The matrix is singular working accuracy'，So replace the above line with this form, although the operation will be slower. Refer to: http://www.ilovematlab.cn/thread-301697-1-1.html and http://zhidao.baidu.com/link?url=GzrsyLqNSkvuryZvr4UAMpxi4PIxjpUtJ9HmJ2nL8jVadDjz5CRMbpoxfmk-mRJZ6bsvqgsPHbV7pslq248SpURQkYUmSNhYBh2VoPYSUQe
  end
  d = X(:,A)*b_OLS - mu;

  if isempty(I)
    % if all variables active, go all the way to the OLS solution
    gamma = 1;
  else
    % compute length of walk along equiangular direction
    cd = (X(:,I)'*d);
    gamma = [ (c - cmax)./(cd - cmax); (c + cmax)./(cd + cmax) ];
    gamma = min(gamma(gamma > 0)); % take gamma Minimum value of positive part
  end

  % update beta
  if storepath
    b(A,step + 1) = b(A,step) + gamma*(b_OLS - b(A,step)); % update beta
  else
    b_prev = b;
    b(A) = b(A) + gamma*(b_OLS - b(A)); % update beta
  end

  % update position
  mu = mu + gamma*d;
  
  % increment step counter
  step = step + 1;

  % Early stopping at specified bound on L1 norm of beta
  if stop > 0
    if storepath
      t2 = sum(abs(b(:,step)));
      if t2 >= stop
        t1 = sum(abs(b(:,step - 1)));
        s = (stop - t1)/(t2 - t1); % interpolation factor 0 < s < 1
        b(:,step) = b(:,step - 1) + s*(b(:,step) - b(:,step - 1));
        stopCond = 1;
      end
    else
      t2 = sum(abs(b));
      if t2 >= stop
        t1 = sum(abs(b_prev));
        s = (stop - t1)/(t2 - t1); % interpolation factor 0 < s < 1
        b = b_prev + s*(b - b_prev);
        stopCond = 1;
      end
    end
  end
    
  % Early stopping at specified number of variables
  if stop < 0
    stopCond = length(A) >= -stop;
  end
end

% trim beta
if storepath && size(b,2) > step
  b(:,step + 1:end) = [];
end

%% Compute auxilliary measures
if nargout == 2 % only compute if asked for
  info.steps = step - 1;
  b0 = pinv(X)*y; % regression coefficients of low-bias model
  penalty0 = sum(abs(b0)); % L1 constraint size of low-bias model Low bias model
  indices = (1:p)';
  
  if storepath % for entire path
    q = info.steps + 1;
    info.df = zeros(1,q);
    info.Cp = zeros(1,q);
    info.AIC = zeros(1,q);
    info.BIC = zeros(1,q);
    info.s = zeros(1,q);
    sigma2e = sum((y - X*b0).^2)/n;
    for step = 1:q
      A = indices(b(:,step) ~= 0); % active set The active set, the valid set, is the index of those numbers whose columns are not 0
      % compute godness of fit measurements Cp, AIC and BIC
      r = y - X(:,A)*b(A,step); % residuals residual
      rss = sum(r.^2); % residual sum-of-squares Sum of squares of residuals
      info.df(step) = step - 1;
      info.Cp(step) = rss/sigma2e - n + 2*info.df(step);
      info.AIC(step) = rss + 2*sigma2e*info.df(step);
      info.BIC(step) = rss + log(n)*sigma2e*info.df(step);
      info.s(step) = sum(abs(b(A,step)))/penalty0;
    end
    
  else % for single solution
    info.s = sum(abs(b))/penalty0;
    info.df = info.steps;
  end

4, CARS algorithm

Competitive adaptive reweighted sampling (CARS) is a characteristic variable selection method combining Monte Carlo sampling and PLS model regression coefficient, imitating the principle of "survival of the fittest" in Darwin theory (Li et al., 2009). In CARS algorithm, adaptive weighted sampling is used every time (adaptive reweighted sampling, ARS) keep the points with large weight of the absolute value of the regression coefficient in the PLS model as a new subset, remove the points with small weight, and then establish the PLS model based on the new subset. After multiple calculations, select the wavelength in the subset with the smallest root mean square error (RMSECV) of the PLS model interactive verification as the characteristic wavelength

function F=carspls(X,y,A,fold,method,num) 
tic;
%+++ Initial settings.
if nargin<6;num=50;end;
if nargin<5;method='center';end;
if nargin<4;fold=5;end;
if nargin<3;A=2;end;

%+++ Initial settings.
[Mx,Nx]=size(X);
A=min([Mx Nx A]);
index=1:Nx;
ratio=0.9;
r0=1;
r1=2/Nx;
Vsel=1:Nx;
Q=floor(Mx*ratio);
W=zeros(Nx,num);
Ratio=zeros(1,num);

%+++ Parameter of exponentially decreasing function. 
b=log(r0/r1)/(num-1);  a=r0*exp(b);

%+++ Main Loop
for iter=1:num
     
     perm=randperm(Mx);   
     Xcal=X(perm(1:Q),:); ycal=y(perm(1:Q));   %+++ Monte-Carlo Sampling.
     
     PLS=pls(Xcal(:,Vsel),ycal,A,method);    %+++ PLS model
     w=zeros(Nx,1);coef=PLS.coef_origin(1:end-1,end);
     w(Vsel)=coef;W(:,iter)=w; 
     w=abs(w);                                  %+++ weights
     [ws,indexw]=sort(-w);                      %+++ sort weights
     
     ratio=a*exp(-b*(iter+1));                      %+++ Ratio of retained variables.
     Ratio(iter)=ratio;
     K=round(Nx*ratio);  
     
     
     w(indexw(K+1:end))=0;                      %+++ Eliminate some variables with small coefficients.  
     
     Vsel=randsample(Nx,Nx,true,w);                 %+++ Reweighted Sampling from the pool of retained variables.                 
     Vsel=unique(Vsel);              
     fprintf('The %dth variable sampling finished.\n',iter);    %+++ Screen output.
 end

%+++  Cross-Validation to choose an optimal subset;
RMSEP=zeros(1,num);
Q2_max=zeros(1,num);
Rpc=zeros(1,num);
for i=1:num
   vsel=find(W(:,i)~=0);
 
   CV=plscvfold(X(:,vsel),y,A,fold,method,0);  
   RMSEP(i)=CV.RMSECV;
   Q2_max(i)=CV.Q2_max;   
   
   Rpc(i)=CV.optPC;
   fprintf('The %d/%dth subset finished.\n',i,num);
end
[Rmin,indexOPT]=min(RMSEP);
Q2_max=max(Q2_max);




%+++ save results;
time=toc;
%+++ output
F.W=W;
F.time=time;
F.cv=RMSEP;
F.Q2_max=Q2_max;
F.minRMSECV=Rmin;
F.iterOPT=indexOPT;
F.optPC=Rpc(indexOPT);
Ft.ratio=Ratio;
F.vsel=find(W(:,indexOPT)~=0)';



function sel=weightsampling_in(w)
%Bootstrap sampling
%2007.9.6,H.D. Li.

w=w/sum(w);
N1=length(w);
min_sec(1)=0; max_sec(1)=w(1);
for j=2:N1
   max_sec(j)=sum(w(1:j));
   min_sec(j)=sum(w(1:j-1));
end
% figure;plot(max_sec,'r');hold on;plot(min_sec);
      
for i=1:N1
  bb=rand(1);
  ii=1;
  while (min_sec(ii)>=bb | bb>max_sec(ii)) & ii<N1;
    ii=ii+1;
  end
    sel(i)=ii;
end      % w is related to the bootstrap chance

%+++ subfunction:  booststrap sampling
% function sel=bootstrap_in(w);
% V=find(w>0);
% L=length(V);
% interval=linspace(0,1,L+1);
% for i=1:L;
%     rn=rand(1);
%     k=find(interval<rn);
%     sel(i)=V(k(end));    
% end

summary

The complete code is available from GitHub warehouse If it is useful to you, please like it!
The code is for academic use only. If you have any questions, contact: wechat: Fu_siry

Topics: Algorithm Machine Learning

Programmer Think