python voice signal processing

Posted by phpbeginner0120 on Mon, 20 Dec 2021 23:55:27 +0100

Blog Park address: https://www.cnblogs.com/LXP-Never/p/10078200.html

1. Introduction

python already supports writing in WAV format, but real-time sound input and output needs to be installed pyAudio . Finally, we'll use pyMedia Decode and play Mp3.

Audio signals are analog signals and need to be saved as digital signals to perform algorithm operations on speech. WAV is a sound file format developed by Microsoft and is often used to store uncompressed sound data.

Voice signal has four important parameters: number of channels, sampling frequency, quantization bits (bit depth) and bit rate.
Channel Number: Can be mono, dual channel...

Sample rate: The total number of sound signal samples sampled per second at 44100Hz means that the signal is decomposed into 44100 copies per second. In other words, it is stored every 1/144100 seconds. If the sampling rate is high, the media will feel the signal is continuous when playing audio.
Bit depth: Also known as bit depth, the number of bits of information in each sample point. 1 byte equals 8 bit s. Usually 8bit, 16bit, 24bit, 32bit...
Bit rate: How many Bit s are processed per second. For example, for a mono channel, with a 44.1KHz/16Bit configuration, its bit rate is 441,00161=705600, in bit s/s (or bps), because the calculated number is usually large, so everyone uses kbit/s, or 705.6kbit/s. When compressing audio, bit rate becomes one of our choices. The higher the bit rate, the better the sound quality. Some commonly used bit rates are:

       - 32kbit/s:  Generally only for voice
       - 96kbit/s:  Commonly used for voice or low quality streaming
	   - 128 Or 160 kbit/s:  Medium bit rate quality
	   -  192kbit/s:  Medium quality bit rate
	   -  256kbit/s:  Commonly used high quality bit rates
	   -  320kbit/s:  MP3 The highest level of standard support

If you need to record and edit sound files yourself, Audacity, an open source, cross-platform, multi-channel recording editing software, is recommended. Audacity is often used in my work to record sound signals and then output them to WAV files for Python processing.
3
If you want a quick look at the voice waveforms and spectrograms, Adobe Audition is recommended. Adobe develops professional software for audio processing. Weibo pays attention to vposy and the download address is at the top. He cracked a lot of adobe's software, including PS, PR...

2. Read audio files

wave Library

The wave library is the standard library for python. For python, wave does not support compression/decompression, but supports reading mono/stereo speech.

wave_read = wave.open(file,mode="rb")

Parameters:

f: Voice file name or file path
mode: Read or write
"rb": read only mode
"wb": Write-only mode
 Return: Read stream

The open() function can be used in the with declaration. When the with block is complete, wave_read.close() or wave_write.close() method called

File path:
For example, voice. The wav file is in the folder at path C:\Users\Never\Desktop\code for the speech
file has three filling formats:
　　　　r"C:\Users\Never\Desktop\code for the speech\voice.wav"
　　　　"C:/Users/Never/Desktop/code for the speech/voice.wav"
　　　　"C:\Users\Never\Desktop\code for the speech\voice.wav"
The three are equivalent, the right dash \ is an ideographic character, and \ is required if \ is to be expressed, and the quotation mark preceded by r denotes the original string.

wave_read.getparams(): Returns all the audio parameters at once, returning a tuple (number of channels, number of quantifiable bits (byte units), sampling frequency, number of samples, compression type, description of compression type). (nchannels, sampwidth, framerate, nframes, comptype, compname)wave module only supports uncompressed data, so the last two information can be ignored.

str_data = wave_read.readframes(nframes): The length of the read (in sampling points) that returns data of type string

wave_data = np.fromstring(str_data, dtype=np.float16): Converts the above string type data to an array of one-dimensional float16 types.

Now wave_data is a one-dimensional short-type array, but because our sound file is two-channel, it consists of two alternating channels: LR

wave_data.shape = (-1, 2) # -1 means unspecified, divided by the number of other dimensions to get an array of n rows and 2 columns.

wave_read.close()　　Close File Stream wave
wave_read.getnchannels()　　Returns the number of audio channels (1 for mono and 2 for stereo).
wave_read.getsampwidth()　　Returns the sample width in bytes
wave_read.getframerate()　　Returns the sampling frequency.
wave_read.getnframes()　　　Returns the number of audio frames.
wave_read.rewind()　　　　　　Reverse the file pointer back to the beginning of the audio stream.
wave_read.tell()　　　　　　Returns the current file pointer position.

Read the Wave file and draw the waveform

# -*- coding: utf-8 -*-
# Read the Wave file and draw the waveform
import wave
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used for normal display of Chinese labels
plt.rcParams['axes.unicode_minus'] = False  # Used to display symbols normally

# Turn on WAV audio
f = wave.open(r"C:\Windows\media\Windows Background.wav", "rb")

# Read format information
# (description of channel number, quantization number, sampling frequency, sampling number, compression type, compression type)
# (nchannels, sampwidth, framerate, nframes, comptype, compname)
params = f.getparams()
nchannels, sampwidth, framerate, nframes = params[:4]
# Number of nchannels channels = 2
# sampwidth quantifier = 2
# Frameerate sampling frequency = 22050
# nframes sample number = 53395

# Read nframes data, return string format
str_data = f.readframes(nframes)

f.close()

# Converts a string to an array to get an array of one-dimensional short type
wave_data = np.fromstring(str_data, dtype=np.short)

# Normalization of assignments
wave_data = wave_data * 1.0 / (max(abs(wave_data)))

# Integrating data from left and right channels
wave_data = np.reshape(wave_data, [nframes, nchannels])
# wave_data.shape = (-1, 2)   # The meaning of -1 is that it is unspecified, divided by the number of other dimensions

# Finally, each sampling time is calculated from the number of points sampled and the sampling frequency.
time = np.arange(0, nframes) * (1.0 / framerate)

plt.figure()
# Left Channel Waveform
plt.subplot(2, 1, 1)
plt.plot(time, wave_data[:, 0])
plt.xlabel("time/s",fontsize=14)
plt.ylabel("Range",fontsize=14)
plt.title("Left channel",fontsize=14)
plt.grid()  # Ruler

plt.subplot(2, 1, 2)
# Right Channel Waveform
plt.plot(time, wave_data[:, 1], c="g")
plt.xlabel("time/s",fontsize=14)
plt.ylabel("Range",fontsize=14)
plt.title("Right",fontsize=14)

plt.tight_layout()  # Tight Layout
plt.show()

Read audio signal with 2 channels

If an error occurs: np.fromstring string size must be a multiple of element size
Reference solutions: https://blog.csdn.net/veritasalice/article/details/104807415

librosa library (recommended)

This is my most commonly used and favorite voice library. Librosa is a third-party python library. Before using it, we need to run on a cmd terminal: pip install librosa. I wrote a blog about librosa voice signal processing specifically.

import librosa
y, sr = librosa.load(path, sr=fs)

This function changes the sampling frequency of sound. If sr is default, librosa.load() reads audio files at a sampling rate of 22050 by default. Audio files above this sampling rate are downsampled and files below this sampling rate are uploaded. Therefore, if you want to read the audio file at the original sampling rate, sr should be set to None. This is y, sr = librosa(filename, sr=None).
Audio data y is a directly normalized array

scipy Library

from scipy.io import wavfile
sampling_freq, audio = wavfile.read("***.wav")

audio is a directly normalized array

3. Write audio files

wave Library

When writing the first frame data, set the number of frames by calling setnframes(), set nchannels (), set sampwidth (), set the quantization bits, set framerate () set the sampling frequency, and then writeframes(wave.tostring()) are used to write the frame data.

wave_write = wave.open(file,mode="wb")

wave_write is a write file stream

wave_write.setnchannels(n)　　Set the number of channels.
wave_write.setsampwidth(n)　　Set sample width to n Bytes, quantized bits
wave_write.setframerate(n)　　Set sampling frequency to n. 
wave_write.setnframes(n)　　Set the number of frames to n
wave_write.setparams(tuple)　　Set all parameters as tuples(nchannels, sampwidth, framerate, nframes,comptype, compname)
wave_write.writeframes(data)　　Write in data Audio of length, in sampling points
wave_write.tell()　　Returns the current position in the file

Write wav file
Write File Method 1

# Author:Ling Ren Battle
# -*- coding:utf-8 -*-
import wave
import numpy as np
import scipy.signal as signal

framerate = 44100   # sampling frequency
time = 10           # Duration

t = np.arange(0, time, 1.0/framerate)

# Call scipy. The chrip function in the signal library,
# Frequency scan waves with length of 10 seconds, sampling frequency of 44.1 kHz, 100Hz to 1kHz are generated
wave_data = signal.chirp(t, 100, time, 1000, method='linear') * 10000

# Because the array returned by the chrip function is float64,
# The astype method of the array needs to be invoked to convert it to short.
wave_data = wave_data.astype(np.short)

# Turn on WAV audio for write operations
f = wave.open(r"sweep.wav", "wb")

f.setnchannels(1)           # Configure Channel Number
f.setsampwidth(2)           # Configure Quantity Bits
f.setframerate(framerate)   # Configure sampling frequency
comptype = "NONE"
compname = "not compressed"

# You can also configure all parameters at once with setparams
# outwave.setparams((1, 2, framerate, nframes,comptype, compname))

# Will wav_data converted to binary data write file
f.writeframes(wave_data.tobytes())
f.close()

Write WAV File Method 2

# Author:Ling Ren Battle
# -*- coding:utf-8 -*-
import wave
import numpy as np
import struct

f = wave.open(r"C:\Windows\media\Windows Background.wav", "rb")
params = f.getparams()
nchannels, sampwidth, framerate, nframes = params[:4]
strData = f.readframes(nframes)
waveData = np.fromstring(strData,dtype=np.int16)
f.close()
waveData = waveData*1.0/(max(abs(waveData)))

# wav file write
# Data to be written to wav, waveData data is still fetched here
outData = waveData
outwave = wave.open("write.wav", 'wb')
nchannels = 1   # Number of channels set to 1
sampwidth = 2   # Quantitative Bit Set to 2
framerate = 8000    # Sampling frequency 8000
nframes = len(outData)    # Sample Points

comptype = "NONE"
compname = "not compressed"
outwave.setparams((nchannels, sampwidth, framerate, nframes,
    comptype, compname))

for i in outData:
        outwave.writeframes(struct.pack('h', int(i * 64000 / 2)))

        # struct.pack(FMT, V1) converts the value of V1 to an FMT format string
outwave.close()

librosa library (recommended)

librosa.output.write_wav(path, y, sr, norm=False)

Parameters:

path: str，Save Output wav Path to file
y: np.ndarry Audio Time Series
sr: y Sampling rate of
norm: True/False，Whether to start amplitude normalization

At 0.8. In versions later than 0, librosa will delete this function. The following functions are recommended:

import soundfile
soundfile.write(file, data, samplerate)

Parameters:

file: Save Output wav Path to file
data: Audio data
samplerate: sampling rate

scipy Library

from scipy.io.wavfile import write
write(output_filename, freq, audio)

import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write

# Define output file to store audio
output_file = 'output_generated.wav'

# Specify parameters for audio generation
duration = 3            # Unit seconds
sampling_freq = 44100   # Unit Hz
tone_freq = 587         # Frequency of tones
min_val = -2 * np.pi
max_val = 2 * np.pi

# Generate audio signal
t = np.linspace(min_val, max_val, duration * sampling_freq)
audio = np.sin(2 * np.pi * tone_freq * t)

# Adding noise (random values between duration * sampling_freq (0,1))
noise = 0.4 * np.random.rand(duration * sampling_freq)
audio += noise

scaling_factor = pow(2,15) - 1  # Convert to 16-bit integer
audio_normalized = audio / np.max(np.abs(audio))    # normalization
audio_scaled = np.int16(audio_normalized * scaling_factor)  # What does this sentence mean?

write(output_file, sampling_freq, audio_scaled) # Write Output File

audio = audio[:300] # Get the first 300 audio signals

x_values = np.arange(0, len(audio), 1) / float(sampling_freq)
x_values *= 1000    # Convert timeline units to seconds

plt.plot(x_values, audio, color='blue')
plt.xlabel('Time (ms)')
plt.ylabel('Amplitude')
plt.title('Audio signal')
plt.show()

Synthesize tuned music

import json
import numpy as np
from scipy.io.wavfile import write
import matplotlib.pyplot as plt          

# Define composite tones
def Synthetic_tone(freq, duration, amp=1.0, sampling_freq=44100):
    # Establish timeline
    t = np.linspace(0, duration, int(duration * sampling_freq))
    # Build audio signal
    audio = amp * np.sin(2 * np.pi * freq * t)
    return audio.astype(np.int16)


# The json file contains some scales and their frequencies
tone_map_file = 'tone_freq_map.json'

# Read Frequency Mapping File
with open(tone_map_file, 'r',encoding='UTF-8') as f:
    tone_freq_map = json.loads(f.read())
    print(tone_freq_map)
# {'A': 440, 'Asharp': 466, 'B': 494, 'C': 523, 'Csharp': 554, 'D': 587, 'Dsharp': 622, 'E': 659, 'F': 698, 'Fsharp': 740, 'G': 784, 'Gsharp': 831}

# Set input parameters to generate G-tone
input_tone = 'G'
duration = 2             # seconds
amplitude = 10000        # amplitude
sampling_freq = 44100    # Hz
# Generate scale
synthesized_tone = Synthetic_tone(tone_freq_map[input_tone], duration, amplitude, sampling_freq)

# Write Output File
write('output_tone.wav', sampling_freq, synthesized_tone)

# Scale and its continuous time
tone_seq = [('D', 0.3), ('G', 0.6), ('C', 0.5), ('A', 0.3), ('Asharp', 0.7)]

# Building an Audio Signal Based on Chord Sequence
output = np.array([])
for item in tone_seq:
    input_tone = item[0]
    duration = item[1]
    synthesized_tone = Synthetic_tone(tone_freq_map[input_tone], duration, amplitude, sampling_freq)
    output = np.append(output, synthesized_tone, axis=0)

# Write Output File
write('output_tone_seq.wav', sampling_freq, output)

tone_freq_map.json

{
    "A": 440,
    "Asharp": 466,
    "B": 494,
    "C": 523,
    "Csharp": 554,
    "D": 587,
    "Dsharp": 622,
    "E": 659,
    "F": 698,
    "Fsharp": 740,
    "G": 784,
    "Gsharp": 831
}

-Audio playback
The pyaudio library is used for playing wav files
P = pyaudio. PyAudio() stream = p.open (format = p.get_format_from_width), channels, rate, output = True) stream. Write (data) #Play data data

The main parameters for the open() method of the pyaudio object are listed below:

rate: sampling frequency
Channels: number of channels
Format: Quantitative format of the sampled values (paFloat32, paInt32, paInt24, paInt16, paInt8...). In the example above, use get_ Format_ From_ The width method will wf. The return value of sampwidth() 2 is converted to paInt16
Input: Input stream flag, open input stream if True
Output: Output stream flag, turn on output stream if True
input_device_index: The number of the device used by the input stream, if not specified, the default device of the system is used
output_device_index: The number of the device used by the output stream, if not specified, the default device of the system is used
frames_per_buffer: The size of the underlying cache block, which consists of N blocks of the same size
start: Specifies whether the input and output streams will be opened immediately, with a default value of True

Play wav audio

# -*- coding: utf-8 -*-
import pyaudio
import wave

chunk = 1024
wf = wave.open(r"c:\WINDOWS\Media\Windows Background.wav", 'rb')
p = pyaudio.PyAudio()

# Turn on sound output stream
stream = p.open(format = p.get_format_from_width(wf.getsampwidth()),
                channels = wf.getnchannels(),
                rate = wf.getframerate(),
                output = True)

# Write sound output stream to sound card for playback
while True:
    data = wf.readframes(chunk)
    if data == "":
        break
    stream.write(data)

stream.stop_stream()
stream.close()
p.terminate()   # Close PyAudio

Sound recording

With SAMPLING_RATE is the sampling frequency, with NUM_in each read block SAMPLES Sampled Data Blocks, when COUNT_is in the read sample data When NUM values are larger than LEVEL samples, save the data into a WAV file, once the data is saved, the minimum length of the data saved is SAVE_LENGTH blocks. The WAV file is named the moment it was saved.

The data read from the sound card is binary, similar to the data read from the WAV file. Since we save the sample values in paInt16 format (short type of 16bit), we convert it to dtype to np. Array of shorts.

'''
with SAMPLING_RATE For sampling frequency,
Read one block at a time with NUM_SAMPLES Data blocks of sample points,
When the sample data read in has COUNT_NUM Values greater than LEVEL When sampling,
Save Sampled Data in WAV Files,
Once you start saving your data, you can save it as short as SAVE_LENGTH Data blocks.

Data read from sound card and from WAV Files read like binary data.
Because we use paInt16 format(16bit Of short type)Save the sample values,
So convert itself to dtype by np.short Array.
'''


from pyaudio import PyAudio, paInt16
import numpy as np
import wave

# Save the data in the data to a WAV file named filename
def save_wave_file(filename, data):
    wf = wave.open(filename, 'wb')
    wf.setnchannels(1)          # single channel
    wf.setsampwidth(2)          # Quantization Bits
    wf.setframerate(SAMPLING_RATE)  # Set Sampling Frequency
    wf.writeframes(b"".join(data))  # Write Voice Frame
    wf.close()


NUM_SAMPLES = 2000      # Size of pyAudio internal cache block
SAMPLING_RATE = 8000    # Sampling frequency
LEVEL = 1500           # The threshold for sound preservation, below which no recording occurs
COUNT_NUM = 20 # Cached Fast Class Records sound if there are 20 samples larger than the threshold
SAVE_LENGTH = 8 # Minimum length of sound recording: SAVE_LENGTH * NUM_SAMPLES Samples

# Turn on sound input
pa = PyAudio()
stream = pa.open(format=paInt16, channels=1, rate=SAMPLING_RATE, input=True,
                frames_per_buffer=NUM_SAMPLES)

save_count = 0  # To count
save_buffer = []    #

while True:
    # Read in NUM_SAMPLES Samples
    string_audio_data = stream.read(NUM_SAMPLES)
    # Convert read data to an array
    audio_data = np.fromstring(string_audio_data, dtype=np.short)
    # Calculate the number of samples larger than LEVEL
    large_sample_count = np.sum( audio_data > LEVEL )
    print(np.max(audio_data))
    # If the number is greater than COUNT_NUM, then save at least SAVE_LENGTH Blocks
    if large_sample_count > COUNT_NUM:
        save_count = SAVE_LENGTH
    else:
        save_count -= 1

    if save_count < 0:
        save_count = 0

    if save_count > 0:
        # Store the data to be saved in save_ In buffer
        save_buffer.append( string_audio_data )
    else:
        # Will save_ The data in the buffer is written to the WAV file whose filename is the time of saving
        if len(save_buffer) > 0:
            filename = "recorde" + ".wav"
            save_wave_file(filename, save_buffer)
            print(filename, "saved")
            break

3. Speech signal processing

3.1 Generation and perception of voice signal

To analyze speech, we must first extract the characteristic parameters that can represent the speech. Only with these parameters can we use them for effective processing. In the process of processing voice signal, the quality of voice signal depends not only on the processing method, but also on the selection of appropriate feature parameters.

Voice signal is a non-stationary time-varying signal, but voice signal is formed by the stimulation pulse of the glottis through the vocal channel, and the muscle movement of the vocal channel (human mouth, nose) is slow, so the voice signal can be considered stationary and unchanged in a "short time" (10-30ms). This constitutes the "short-time analysis technique" of voice signal.

In short-term analysis, voice signal is divided into a segment of voice frames, each frame is usually 10-30 Ms. Our research is based on the analysis of the voice characteristics of each frame.

Different speech feature parameters extracted correspond to different voice signal analysis methods: time domain analysis, frequency domain analysis, cepstrum domain analysis... Since the most important perception characteristics of voice signal are reflected in the power spectrum, and phase change plays a small role, all speech frequency domain analysis is more important.

3.2 Signal windowing

Generally, window is needed for signal truncation and framing, because truncation always leaks energy in frequency domain, and window function can reduce the impact of truncation.
Window function in scipy. In the signal signal signal processing toolbox, such as the hanning window:

import matplotlib.pyplot as plt
import scipy.signal as signal
plt.figure(figsize=(6,2))
plt.plot(signal.hanning(512))
plt.show()

3.3 signal framing

In framing, there will be some overlap between adjacent frames, the frame length ( w l e n wlen wlen) =overlap ( o v e r l a p overlap overlap) + frame shift ( i n c inc inc), if there is no overlap between adjacent frames, then due to the shape of the window function, the captured voice frame edge will be lost, so the overlap will be set. i n c inc inc is a frame shift, which represents the offset of the next frame from the previous frame. f s f_s fs is the sampling rate, f n f_n fn Represents the number of frames of a voice signal.

N − o v e r l a p i n c = N − w l e n + i n c i n c \frac{N-overlap}{inc}=\frac{N-wlen+inc}{inc} incN−overlap=incN−wlen+inc
The theoretical basis of signal framing, where x x x is the voice signal, w w w is a window function:
y ( n ) = ∑ − n = N 2 + 1 N 2 x ( m ) w ( n − m ) y(n)=\sum_{-n=\frac{N}{2}+1}^{\frac{N}{2}}x(m)w(n-m) y(n)=−n=2N+1∑2Nx(m)w(n−m)

Window truncation is similar to sampling. In order to ensure that adjacent frames do not differ too much, there is usually a frame shift between frames, which is actually the function of interpolation smoothing.

Give a diagram:

This is mainly used with the numpy toolkit, which covers the following instructions:

np.repeat: Mainly direct repetition
np.tile: Mainly cyclical

Compare:
Vector case:

Matrix case:
For data

repeat operation:

tile operation:

Corresponding results:

Code implementation for corresponding framing:
This is an example without windows:

Voice framing without windowing

import numpy as np
import wave
import os
#import math

def enframe(signal, nw, inc):
   '''Converts an audio signal to a frame.
   Parameter meaning:
   signal:Original Audio Model
   nw:Length of each frame(This refers to the length of the sample point, that is, the sampling frequency multiplied by the time interval)
   inc:Interval between adjacent frames (as defined above)
   '''
   signal_length=len(signal) #Total Signal Length
   if signal_length<=nw: #If the signal length is less than the length of a frame, the number of frames is defined as 1
       nf=1
   else: #Otherwise, calculate the total length of the frame
       nf=int(np.ceil((1.0*signal_length-nw+inc)/inc))
   pad_length=int((nf-1)*inc+nw) #Total paved length of all frames combined
   zeros=np.zeros((pad_length-signal_length,)) #Insufficient length filled with 0, similar to expanded array operation in FFT
   pad_signal=np.concatenate((signal,zeros)) #The filled signal is marked as pad_signal
   indices=np.tile(np.arange(0,nw),(nf,1))+np.tile(np.arange(0,nf*inc,inc),(nw,1)).T  #Equivalent to extracting the time points of all frames to get a matrix of nf*nw lengths
   indices=np.array(indices,dtype=np.int32) #Converting indices to matrices
   frames=pad_signal[indices] #Get Frame Signal
#    win=np.tile(winfunc(nw),(nf,1))  #Window window function, default here is 1
#    return frames*win   #Return Frame Signal Matrix
   return frames
def wavread(filename):
   f = wave.open(filename,'rb')
   params = f.getparams()
   nchannels, sampwidth, framerate, nframes = params[:4]
   strData = f.readframes(nframes)#Read audio, string format
   waveData = np.fromstring(strData,dtype=np.int16)#Convert string to int
   f.close()
   waveData = waveData*1.0/(max(abs(waveData)))#wave amplitude normalization
   waveData = np.reshape(waveData,[nframes,nchannels]).T
   return waveData

filepath = "./data/" #Add Path
dirname= os.listdir(filepath) #Get all the file names under the folder 
filename = filepath+dirname[0]
data = wavread(filename)
nw = 512
inc = 128
Frame = enframe(data[0], nw, inc)

Windowed voice framing

def enframe(signal, nw, inc, winfunc):
    '''Converts an audio signal to a frame.
    Parameter meaning:
    signal:Original Audio Model
    nw:Length of each frame(This refers to the length of the sample point, that is, the sampling frequency multiplied by the time interval)
    inc:Interval between adjacent frames (as defined above)
    '''
    signal_length=len(signal) #Total Signal Length
    if signal_length<=nw: #If the signal length is less than the length of a frame, the number of frames is defined as 1
        nf=1
    else: #Otherwise, calculate the total length of the frame
        nf=int(np.ceil((1.0*signal_length-nw+inc)/inc))
    pad_length=int((nf-1)*inc+nw) #Total paved length of all frames combined
    zeros=np.zeros((pad_length-signal_length,)) #Insufficient length filled with 0, similar to expanded array operation in FFT
    pad_signal=np.concatenate((signal,zeros)) #The filled signal is marked as pad_signal
    indices=np.tile(np.arange(0,nw),(nf,1))+np.tile(np.arange(0,nf*inc,inc),(nw,1)).T  #Equivalent to extracting the time points of all frames to get a matrix of nf*nw lengths
    indices=np.array(indices,dtype=np.int32) #Converting indices to matrices
    frames=pad_signal[indices] #Get Frame Signal
    win=np.tile(winfunc,(nf,1))  #Window window function, default here is 1
    return frames*win   #Return Frame Signal Matrix

overlap and add
Split the framed speech back into the complete speech

def overlap_add(x, window_size, hop_size):
    # x (frames, frame_length)
    frames, frame_length = x.shape
    wav_len = frames * hop_size + window_size - hop_size  # Frame length
    print("Frame length", wav_len)
    wav = np.zeros((wav_len,), dtype=x.dtype)

    for frame in range(frames):
        if frame == frames - 1:
            # Last Frame
            wav[hop_size * frame: hop_size * frame + window_size] += x[frame]  # Add whole frame in front
        else:
            wav[hop_size * frame: hop_size * frame + hop_size] += x[frame][:hop_size]  # Add in front of frame

    return wav

3.4 Speech signal processing in short time domain

Short-time energy and short-time mean amplitude

Main uses of short-term energy and short-term mean amplitude:

Distinguish between voiced and clear sounds because the short energy E(i)E(i) of voiced sounds is much higher than that of clear sounds.
Distinguish between initial and vowel and between NO and segmental
Short Time Average Cross zero ratio

For continuous voice signals, the zero-crossing rate means that the time-domain waveform passes through the time axis, and for discrete signals, if the adjacent sample values change the symbol, it is called zero-crossing.

Effect:

Voiced voice signal energy is concentrated below 3kHz due to the high frequency drop of the spectrum caused by glottic wave.
Most of the energy is focused on a higher frequency when the voice is articulated.

Because high frequencies mean high short average zero-crossing rates and low frequencies mean low short average zero-crossing rates, voiced sounds have lower zero-crossing rates and clear sounds have higher zero-crossing rates.

1.The short average zero-crossing rate can be used to find the voice signal from the background noise.
2.It can be used to determine the starting and ending positions of silent and silent segments.
3.Average energy is more effective when background noise is low, and short average zero-crossing rate is more effective when background noise is high.

Short-time autocorrelation function

Short-time autocorrelation functions are mainly used for endpoint detection and pitch extraction. Peak characteristics occur at integer times of the vowel gene frequency. The pitch is usually estimated based on the first peak except R(0), but no significant peak is seen in the short-time autocorrelation function of the initial.

Short-time mean amplitude difference function

It is used to detect pitch cycles and is more computationally simple than a short autocorrelation function.

Short-time frequency domain processing of voice signals
In speech signal processing, signal analysis and processing play an important role in frequency domain or other transform domain. Studying speech in frequency domain can make some features of the signal that can not be displayed in time domain obvious. The nature of an audio signal is determined by its frequency content.

Converting a time-domain signal to a frequency-domain signal generally performs a short-time Fourier transformation of the speech.

fft_audio = np.fft.fft(audio)

Draw the spectrum of voice signal

import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt

sampling_freq, audio = wavfile.read(r"C:\Windows\media\Windows Background.wav")   # read file

audio = audio / np.max(audio)   # Normalization, Standardization

# Applying Fourier Transform
fft_signal = np.fft.fft(audio)
print(fft_signal)
# [-0.04022912+0.j         -0.04068997-0.00052721j -0.03933007-0.00448355j
#  ... -0.03947908+0.00298096j -0.03933007+0.00448355j -0.04068997+0.00052721j]

fft_signal = abs(fft_signal)
print(fft_signal)
# [0.04022912 0.04069339 0.0395848  ... 0.08001755 0.09203427 0.12889393]

# Establish timeline
Freq = np.arange(0, len(fft_signal))

# Drawing voice signal
plt.figure()
plt.plot(Freq, fft_signal, color='blue')
plt.xlabel('Freq (in kHz)')
plt.ylabel('Amplitude')
plt.show()

Extracting Frequency Domain Features
After converting the signal into the frequency domain, it also needs to be converted into a useful form, Meyer Frequency Cepstrum Coefficient (MFCC), which first calculates the power spectrum of the signal, then extracts the characteristics using a combination of filter banks and discrete cosine transformations.

Extracting MFCC Features

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank

# Read Input Audio File
sampling_freq, audio = wavfile.read("input_freq.wav")

# Extracting MFCC and filter bank characteristics
mfcc_features = mfcc(audio, sampling_freq)
filterbank_features = logfbank(audio, sampling_freq)

print('\nMFCC:\n Number of windows =', mfcc_features.shape[0])
print('Length of each feature =', mfcc_features.shape[1])
print('\nFilter bank:\n Number of windows =', filterbank_features.shape[0])
print('Length of each feature =', filterbank_features.shape[1])

# Draw a feature map to visualize MFCC. Transpose the matrix so that the time domain is horizontal
mfcc_features = mfcc_features.T
plt.matshow(mfcc_features)
plt.title('MFCC')
# Visualize filter bank characteristics. Transpose the matrix so that the time domain is horizontal
filterbank_features = filterbank_features.T
plt.matshow(filterbank_features)
plt.title('Filter bank')

plt.show()

Spectrogram

Most signals can be decomposed into several sinusoidal waves of different frequencies.
Of these sinusoids, the lowest frequency is called the fundamental wave of the signal, and the rest are called the harmonics of the signal.
There is only one fundamental wave, which can be called first harmonic. There can be many harmonics, each harmonic frequency is an integer multiple of the fundamental frequency. Harmonics may vary in size.
A series of bar graphs drawn with the frequency of harmonics in the x-coordinate and the amplitude (size) in the p-coordinate are called the spectrum, which accurately reflects the internal structure of the signal.

Spectrogram combines the characteristics of time and frequency domains. It clearly shows the change of speech frequency over time. The horizontal axis of the spectrum is time, and the vertical axis is the intensity of any given frequency component at a given time. Color depth indicates a large spectrum value, light color indicates a small spectrum value, and different black and white levels on the spectrum form different patterns, called voice print. Without the speaker, voice print is different and can be used for voice print recognition.
In fact, we get the framing signal, frequency domain transformation to take the amplitude, you can get the spectrogram, if only observation, matplotlib.pyplot has specgram directives:

import wave
import matplotlib.pyplot as plt
import numpy as np

f = wave.open(r"C:\Windows\media\Windows Background.wav", "rb")
params = f.getparams()
nchannels, sampwidth, framerate, nframes = params[:4]
strData = f.readframes(nframes)#Read audio, string format
waveData = np.fromstring(strData,dtype=np.int16)#Convert string to int
waveData = waveData*1.0/(max(abs(waveData)))#wave amplitude normalization
waveData = np.reshape(waveData,[nframes,nchannels]).T
f.close()

plt.specgram(waveData[0],Fs = framerate, scale_by_freq = True, sides = 'default')
plt.ylabel('Frequency(Hz)')
plt.xlabel('Time(s)')
plt.show()

matlab Spectrogram

[Y,FS]=audioread('p225_355_wb.wav');

% specgram(Y,2048,44100,2048,1536);
    %Y1 For waveform data
    %FFT Frame length 2048 points(At 44100 Hz About 46 at frequency ms)
    %Sampling frequency 44.1KHz
    %Window length, generally equal to frame length
    %Frame overlap length, 3 of frame length here/4
specgram(Y,2048,FS,2048,1536);
xlabel('time(s)')
ylabel('frequency(Hz)')
title('Spectrogram')

3.5 Speech Recognition

import os
import numpy as np
import scipy.io.wavfile as wf
import python_speech_features as sf
import hmmlearn.hmm as hl


# 1. Read training audio samples in the training folder, one mfcc matrix for each audio, and one category for each mfcc (apple...)
def search_file(directory):
    """
    :param directory: Path to training audio
    :return: Dictionaries{'apple':[url, url, url ... ], 'banana':[...]}
    """
    # Match incoming directory with current operating system
    directory = os.path.normpath(directory)
    objects = {}
    # curdir: current directory
    # subdirs: All subdirectories under the current directory
    # files: All file names in the current directory
    for curdir, subdirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.wav'):
                label = curdir.split(os.path.sep)[-1]  # os.path.sep is the path separator
                if label not in objects:
                    objects[label] = []
                # Add the path to the label list
                path = os.path.join(curdir, file)
                objects[label].append(path)
    return objects


# Reading training set data
train_samples = search_file('../machine_learning_date/speeches/training')

"""
2. Make all categories apple Of mfcc Combine them together to form a training set.
    training set:
    train_x: [mfcc1,mfcc2,mfcc3,...],[mfcc1,mfcc2,mfcc3,...]...
    train_y: [apple],[banana]...
From the above training set samples, you can train one for matching apple Of HMM. """

train_x, train_y = [], []
# Traverse Dictionary
for label, filenames in train_samples.items():
    # [('apple', ['url1,,url2...'])
    # [("banana"),("url1,url2,url3...")]...
    mfccs = np.array([])
    for filename in filenames:
        sample_rate, sigs = wf.read(filename)
        mfcc = sf.mfcc(sigs, sample_rate)
        if len(mfccs) == 0:
            mfccs = mfcc
        else:
            mfccs = np.append(mfccs, mfcc, axis=0)
    train_x.append(mfccs)
    train_y.append(label)

# 3. Training model, 7 sentences, 7 models created
models = {}
for mfccs, label in zip(train_x, train_y):
    model = hl.GaussianHMM(n_components=4, covariance_type='diag', n_iter=1000)
    models[label] = model.fit(mfccs)  # # {'apple':object, 'banana':object ...}

"""
4. read testing Test samples in folder,
    Test set data:
        test_x  [mfcc1, mfcc2, mfcc3...]
        test_y  [apple, banana, lime]
"""
test_samples = search_file('../machine_learning_date/speeches/testing')

test_x, test_y = [], []
for label, filenames in test_samples.items():
    mfccs = np.array([])
    for filename in filenames:
        sample_rate, sigs = wf.read(filename)
        mfcc = sf.mfcc(sigs, sample_rate)
        if len(mfccs) == 0:
            mfccs = mfcc
        else:
            mfccs = np.append(mfccs, mfcc, axis=0)
    test_x.append(mfccs)
    test_y.append(label)


# 5. Test Model
#    1. score scores were calculated for test samples using seven HMM models.
#    2. Select the category of the model with the highest score among the seven models as the prediction category.
pred_test_y = []
for mfccs in test_x:
    # Determine which HMM model mfccs match better
    best_score, best_label = None, None
    # Traverse 7 models
    for label, model in models.items():
        score = model.score(mfccs)
        if (best_score is None) or (best_score < score):
            best_score = score
            best_label = label
    pred_test_y.append(best_label)

print(test_y)   # ['apple', 'banana', 'kiwi', 'lime', 'orange', 'peach', 'pineapple']
print(pred_test_y)  # ['apple', 'banana', 'kiwi', 'lime', 'orange', 'peach', 'pineapple']

I wrote a blog about the above code to further explain and analyze it. Readers who want to know more about it can move on https://www.cnblogs.com/LXP-Never/p/11415110.html The voice dataset is here.