Double threshold method for endpoint detection

Posted by magicmoose on Sun, 16 Jan 2022 23:21:51 +0100

The double threshold method of endpoint detection mainly uses short-time energy and short-time zero crossing rate. Short time energy is used to distinguish voiced (high energy) and unvoiced (low energy). Short time zero crossing rate zcr is used to distinguish unvoiced (exactly clear consonant) and mute. Clear consonant zcr is high and mute zcr is low. The two ends of speech are consonants: vowels: sounds that are not obstructed when the air flows out of the mouth; consonants: sounds that are obstructed by the mouth or nose; Unvoiced sound: a sound in which the vocal cords do not vibrate. Voiced sound: a sound in which the vocal cords vibrate
1. Theoretical basis: speech signal can generally be divided into silent segment, unvoiced segment and voiced segment. The silent segment is the background noise segment, and the average energy is the lowest; Voiced segment is the voice signal segment corresponding to vocal cord vibration, and the average energy is the highest; The voiceless segment is the voice signal segment sent by the friction, impact or explosion of air in the oral cavity, and the average energy is between the first two. The waveform characteristics of unvoiced segment and silent segment are obviously different. The signal of silent segment changes slowly, while the signal of unvoiced segment changes sharply in amplitude and crosses zero level more times. Experience shows that the zero crossing rate of unvoiced segment is usually the largest. Endpoint detection is to first judge whether / voiced 0 or / silent 0. If there is sound, it is also to judge whether / voiced 0 or / voiced 0. In order to realize endpoint detection correctly, the two characteristics of short-time energy and zero crossing rate are generally used, and the / double threshold detection method is adopted 0.
2. Basic idea: set three thresholds according to the signal: energy threshold, Tl, TH; Zero crossing rate threshold ZCR: when a frame signal is greater than TL or ZCR, it is considered as the beginning and starting point of the signal. When it is greater than TH, it is considered as a formal voice signal. If it is maintained for a period of time, confirm that this signal is the required signal.
3. Steps: (1) before analyzing and processing the speech signal, pre emphasis, framing, windowing and other preprocessing operations must be carried out. The purpose of these operations is to eliminate the impact of aliasing, high-order harmonic distortion, high frequency and other factors on the quality of speech signal caused by human vocal organs themselves and equipment for collecting speech signal. Try to ensure that the signal obtained by subsequent speech processing is more uniform and smooth, provide high-quality parameters for signal parameter extraction, and improve the quality of speech processing. (2) Short time energy short time energy sequence reflects the law that speech amplitude or energy changes slowly with time (3) endpoint detection in zero crossing rate speech signal processing is mainly to automatically detect the beginning and end points of speech. Here, we use the double threshold comparison method for endpoint detection. The double threshold comparison method is characterized by short-time energy E and short-time average zero crossing rate Z. combined with the advantages of Z and E, it makes the detection more accurate, effectively reduces the processing time of the system, eliminates the noise interference in the silent section, and improves the processing performance of the speech signal.

%Endpoint detection algorithm of energy and zero crossing number 1 clear all; file='D:\Big three digital image processing\voice ppt\imut_du\a.wav'; [x,Fs]=audioread(file); 
x=x/max(abs(x));%normalization %Threshold setting 
amp1 = 0.1; amp2 = 0.05; 
zcr1 = 90; zcr2 = 135; 
x=filter([1 -0.98],[1],x);%Pre aggravation 
wlen=200;%Frame length
 inc=100;%Frame shift
  win=hamming(wlen);%Hamming window
   N=length(x);%Signal length 
   time=(0:N-1)/Fs;%Calculate the time scale of the signal 
   X=enframe(x,win,inc)'; %Framing,A column is a frame
    fn=size(X,2)';%Number of frames 
    frameTime=(((1:fn)-1)*inc+wlen/2)/Fs;% Calculate the time corresponding to each frame %Short time energy for i=1:fn y=X(:,i);%Data per frame 
    b=0; 
    for m=1:1:200%Data in one frame 
    b=b+y(m).^2;
    end 
    E(i)=b; 
    end %Short time zero crossing rate
     Z=zeros(1,fn); % initialization, fn Used before
      for i=1:fn y=X(:,i);%Data per frame
       b=0;
        for m=1:1:199 
        if y(m)*y(m+1)<0; 
        b=b+1; 
        end 
        Z(i)=b; 
        end 
        end %Find the threshold of short-term energy to determine the beginning and end of speech 
        zeros(i); q=[];%Store the location of the start voice boundary
         i1=1; 
         while (i1<length(E)) 
         for i1=i1:1:length(E)
          e=E(i1); 
          if e>amp1 q=[q i1-1]; 
          i1=i1+1; 
          for i2=i1:length(E)
           e=E(i2); 
           if e<amp2 q=[q i2+1];
            i1=i2+1; 
            break 
            end 
            end
             break
              end
               end 
               end%Zero crossing rate 
               i1=1; 
               w=[];%Store end voice limit position 
               while (i1<length(Z))
                for i1=i1:1:length(Z)
                 e=Z(i1); 
                 if e>zcr2 w=[w i1]; 
                 i1=i1+1;
                  for i2=i1:length(Z) 
                  e=Z(i2); 
                  if e<zcr1 w=[w i2+1];
                   i1=i2+1; 
                   break
                    end
                     end 
                     break 
                     end
                      end
                       end %Drawing 

subplot(311) plot(time,x); title('original signal ') 
xlabel('time');ylabel('range'); 
subplot(312) plot(frameTime(q),E(q),'or'); hold on plot(frameTime,E);
 title('Short time energy') xlabel('time');ylabel('range'); subplot(313) plot(frameTime(w),Z(w),'or'); hold on plot(frameTime,Z); title('Zero crossing rate') xlabel('time');ylabel('frequency');


It can be roughly seen from the above figure that endpoint detection is to judge the beginning and end of speech through the energy and the number of zero crossing times in short-term energy spectrum and Short-term zero crossing detection. Due to different speech signals, the effect is also different. This method is applicable to the case of large difference in zero crossing times and large difference in energy amplitude. The same program changes audio buzz4 After wav '(see Annex), reset the threshold amp1 = 1; amp2 = 1; zcr1 = 60; zcr2 = 50; After that, yes