Voice Recognition Analysis with Mel Frequency Cepstral Coefficient (MFCC)
The combination of two machine learning algorithms, namely Mel Frequency Cepstral Coefficient (MFCC) and Convolutional Neural Network (CNN) can be used to identify sounds in noisy environments with high accuracy results. MFCC is a feature extraction method widely used in speech technology research because it is considered quite reliable in presenting voice signals through the feature extraction process. Feature extraction is the process of converting voice signals into several types of parameters, such as cepstral coefficients, which represent audio files, or feature vectors which convert voice signals into several vectors. MFCC expressed on the mel scale adapts the human hearing system, where sound signals will be filtered linearly for low frequencies less than 1000 Hz and logarithmically for high frequencies above 1000 Hz so that they can represent sound signals as humans represent them. One of the advantages of MFCC is being able to capture important information contained in voice signals and produce data as minimal as possible without losing important information in recognizing the voice.
The following are the steps for the voice detection process with MFCC:
Preprocessing
The preprocessing stage is divided into two stages: silent removal and data stretching. Silent removal is a process to clean the dataset from unnecessary sound recording pauses so that the dataset will only consist of voice recordings without any silent pauses. Stretching is changing the duration or speed of an audio signal without affecting its tone, which aims to equalize the time of each recorded data that has gone through the silent removal stage.
Pre-Emphasis
Pre-emphasis is the first stage in the MFCC model design process. This stage is carried out because the signal often experiences noise interference, so it has the potential to affect the level of accuracy results. Pre-emphasis is intended so that the baseband level in the high-frequency section still has good signal quality.
Frame Blocking
After the signal has passed through the pre-emphasis process, a frame-blocking process is carried out where the signal will be blocked into a frame with N samples and shifted by M samples. N is the frame width, and M is the shift width of each frame. The structure will be taken as long as possible to get good frequency resolution, while the shortest possible time is intended to get the best time domain.
Windowing
The frame-blocking process results in a discontinuous effect at the ends of the frame, so a windowing process is carried out, which will reduce the unsteady impact and smooth the spectrum after going through the frame-blocking process. The commonly used windowing processes are Rectangular Windows, Hamming Window,s and Hanning Windows. Of the three functions, the researcher uses the Hanning Window because the output produced is smoother than the other functions.
Fast Fourier Transform (FFT)
Fast Fourier Transform is the development of the Discrete Fourier Transform (DFT) algorithm developed by Cooley and Tukey, which functions to convert digital signals in the time domain to the frequency domain. At this stage, the signal will be decomposed into sinusoidal signals in the form of actual units and imaginary units where sinusoidal signals are sinusoids of the same frequency but with different amplitudes and phases.
Read also : Data Science in E-commerce
Mel Frequency Wrapping (MFW)
Mel Frequency Wrapping (MFW) is a filter in the form of a filter bank that is used to determine the energy size of a particular frequency band contained in a sound signal. The output process obtained from the filter is called the mel spectrum.
Discrete Cosine Transform (DCT)
DCT has the same concept as the inverse Fourier transform. However, the output produced by DCT is close to Principle Component Analysis (PCA). PCA is a classic static method widely used in data analysis and compression. DCT, a sinusoidal unit transformation class member, is also often used for processing images such as JPEG files.
Read also : How to Choose The Best Machine Learning Algorithms and Tools
CNN Model Design
The data resulting from feature extraction at MFCC in the form of a spectrogram will be carried out by the training process and data testing using CNN. The data will be divided into two with a ratio of 70:30, where 70 is training data and 30 is testing data. Model learning patterns on the computer to recognize the correct pronunciation of ‘ain, the wrong pronunciation of ‘ain, and the pronunciation that is not ‘ain. An accurate model will form a smooth curve by following the trend pattern of the tested data.
The machine learning model’s performance will be tested and evaluated using the Confusion Matrix as a reference. The Confusion Matrix will represent the predictions and conditions of the actual data generated by the Machine Learning algorithm. Based on this Confusion Matrix, the level of Accuracy will be determined, which is the ratio of correct predictions (positive and negative) to the actual data.