Voice Commands - Speech Recognition

Ifrim Ciprian
Mar 19, 2022
4 min read

Updated: Apr 7, 2022

As I specified in one of my previous blogs, I have moved from the Voice Recognition Module V3 to the Nano 33 BLE Sense module, able to use Neural Networks through Edge Impulse, to understand voice commands.

Edge Impulse is the leading development platform for embedded machine learning, used by over 3,000 enterprises across 62,000+ ML projects worldwide.

The commands wanted are as follows:

2 seconds
How is it going to be today?
What is the Rainfall Forecast?

https://forum.edgeimpulse.com/t/webiste-accuracy-higher-than-on-nano-33-ble-sense-speech/3987

1.5 seconds
Environment Conditions
Cloud Information
Discomfort Evaluation

1 second
Temperature (Dew Point, Feels Like)
Pressure (Barometric Pressure, Sealevel Pressure)
Humidity (Humidity, Relative Humidity, Absolute Humidity)
Altitude
Compass
Air Quality
Heat Index
Health Status (BMP, SpO2, Skin Temperature)
Timer
Date
Current Time
Step Counter

They have been recorded by using the Edge Impulse firmware for the Nano 33 Ble Sense:

Edge Impulse is the leading development platform for machine learning on edge devices, free for developers, and trusted by enterprises worldwide. TinyML enables exciting applications on extremely low-power MCUs. For example, we can detect human motion from just 10 minutes of training data, detect human keywords and classify audio patterns from the environment in real-time.

Once I have all samples, the Edge Impulse UI looks as follows:

Video:

I have created 50 samples with different intonations and from different distances and angles, with 5 classes of noise, rain, traffic, brown noise, distant chatter and muffled noise.

Then I cropped the different samples, to only keep the main audio:

Then these samples are processed using an MFCC block (Mel-Frequency Cepstral Coefficient). The coefficients over time visual is the following:

In sound processing, the mel-frequency cepstrum is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel-frequency cepstral coefficients are coefficients that collectively make up an MFC.

The samples simplied using PCA and plotted on a 3D scatterplot, look as follows:

However, we can choose the amount of MFCC Coefficients to be generated. The one above is with 13 coefficients per samples.

Here is an example with 16 coefficients:

Here is the scatter plot with only 18 commands and only 1 class of noise (brown noise):

MFCC coefficients contain information about the rate changes in the different spectrum bands.

Traditional MFCC systems use only 8–13 cepstral coefficients. The zeroth coefficient is often excluded since it represents the average log-energy of the input signal, which only carries little speaker-specific information.

Then these samples are passed to a Convolutional Neural Network:

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

With the following set architecture:

Which uses TensorFlow and Python to train:

Video:

Resulting in the following accuracy and confusion matrix:

Here we can see plotted from the test dataset the incorrectly classified samples:

Here we can see the test with a blinking LED from Arduino:

Many different runs have been performed with different architectures, layers, neurons, drop rate, learning rate and learning epochs resulting in different levels of performances.

Here are some examples:

1) Training Cycles: 500 - Learning Rate: 0.003 - Validation Set Size - 10% - Data Augmentation: Low Noise+Low Masked Time Bands + the frequency bands not masked:

2) Training Cycles: 800 - Learning Rate: 0.001 - Validation Set Size - 10% - Data Augmentation: High Noise + High Masked Time Bands + the frequency bands not masked:

3) Training Cycles: 600 - Learning Rate: 0.003 - Validation Set Size - 10% - NO Data Augmentation:

4)Training Cycles: 500 - Learning Rate: 0.005 - Validation Set Size - 10% - Data Augmentation: High Noise + Not Masked Time Bands + the frequency bands not masked:

5)Training Cycles: 500 - Learning Rate: 0.003 - Validation Set Size - 10% - Data Augmentation: High Noise + Not Masked Time Bands + the frequency bands not masked:

And many many more runs have been done to find the perfect model with the best accuracy, a total of 25 runs.

Here is the confusion matrix of the best model:

Here we can see the correctly classified samples, as well as the ones that were incorrectly classified on a scatter plot:

With the following on-device processing times/performance:

Edge Impulse generates a library, and the code looks as follows:

Once modified to only output the classes identified with higher than 80% certainty, it performs as follows:

Here a live in-browser demo can be seen:

And here on the Arduino:

It turns out that the Arduino performance is actually lower than the Browser one.

After talking with the developers here: https://forum.edgeimpulse.com/t/webiste-accuracy-higher-than-on-nano-33-ble-sense-speech/3987

It turns out that 50 samples per class is way too low, and that 500-1000 is what is needed. That is why the performance is lower on Arduino, and even in browser, there were some false positives.

However, after some testing, it turns out that by changing the sliding window from 3 to 1 in the arduino code, which changes the amount of ms the device is listening to before perfoming interference, it improves the performance.

As can be seen here: