Voice Recognition - Final Base Version

Ifrim Ciprian
Apr 5, 2022
6 min read

Updated: May 12, 2022

After the last 2 trials of the Voice Recognition systems, I have come to the following conclusions:

A CNN would be more optimised in terms of on-device performance, as well as offer faster processing time for the classification.
A CNN compared to Transfer Learning, enables me to have the exact setup I want, in terms of commands, command lengths, number of commands and processing/architecture.
Longer voice commands are easier to identify than just 1 keyword.
Recording the voice command and then performing classification uses less RAM and performs way better.

Aimed with this knowledge, I went back to the drawing board, and created new commands, new samples and more, and only used recorded interferencing.

I changed the commands that will be used and the outputs from the device, to full on phrases of 1.5 - 2 seconds. As well as increased the samples from 50 to over 200.

1) The first test was performed with only 2 commands, with 260 samples each

The 2 commands were "How is it going to be today?" for weather forecasting and "What are the environment conditions?" for sensor array output from the Nicla Sense.

I recorded all audio samples with Audacity to speed up the process, than using the Arduino Nano 33 Sense microphone connected to edge impulse.

Then I applied labels to every single sample, confined by silence on both sides.

After this I extracted all the labels as separate mp3 files to be uploaded to Edge Impulse.

I went with 18 coefficients for the MFCC processing block and the feature space looks as follows:

We can note that the samples for both voice commands are nicely bundled together and very far from the other command.

After training the CNN with 500 training cycles, 0.002 training rate and no data augmentation. With the following Architecture:

I get the following Confusion Matrix and feature space:

We can note that the classification only takes 5ms and it was able to reach 100% accuracy.

Now, it's time to test it on the device.

I setup the device on a "stand":

Once the Void Loop of the generated library from Edge Impulse has been edited as follows, to react to serial commands to start the recording + classification:

void loop()
{
  while (Serial.available() > 0) {
    char incomingCharacter = Serial.read();
    if (incomingCharacter == '1') {
      ei_printf("Starting...\n");
      delay(320);
      ei_printf("Recording...\n");

      bool m = microphone_inference_record();
      if (!m) {
        ei_printf("ERR: Failed to record audio...\n");
        return;
      }

      ei_printf("Recording done\n");

      signal_t signal;
      signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
      signal.get_data = &microphone_audio_signal_get_data;
      ei_impulse_result_t result = { 0 };

      EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
      if (r != EI_IMPULSE_OK) {
        ei_printf("ERR: Failed to run classifier (%d)\n", r);
        return;
      }

      // print the predictions
      for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        if (result.classification[ix].value > 0.5)
          Serial.println("Highest Prediction Classifier is: " + String(result.classification[ix].label) + "\nAccuracy is: " + String(result.classification[ix].value, 2));
      }
    }
  }
}

And here we can see the performance:

It is working wonderfully and it always identifies the samples correctly.

2) The first test was performed with 7 commands, with 560 samples each.

The voice commands are the following:

How is it going to be today? = ml weather classification, ml rainfall regression

What are the environment conditions? = Temp, feels like, dew point, humidity, relative humidity, absolute humidity, barometric pressure, sea-level pressure, water vapor pressure, dry air pressure

Details about my location = elevation, compass heading, compass roll, gravity vector

Count the number of steps done = total steps, since last command steps, calories burned, distance travelled

Tell me current time and date! = time, date

Present the health report! = Heart rate, skin temperature, heat index, discomfort index, pressure effect, air quality, voc, co2

AURI describe yourself! = project description

Once I created 560 samples per class, as can be seen in the following video:

And labelled all samples:

I proceeded to Edge Impulse, with the same parameters as before, only now with more commands and samples and reached:

That is incredible, 100% accuracy, and then once tested on the device, it was performing well, although it did result in some misclassifications here and there.

So I decided to use Edge Impulse's data augmentation tools to add noise and mask frequency bands/time band to increase accuracy/certainty.

Here are the results:

1) 1.5 second window - high noise+high mask frequency band and time band

2) 1.5 second window - high noise+low mask frequency band and time band

3) 1.5 second window - high noise+no mask frequency band + low time band masking

4) 2 second window - high noise+low mask frequency band and time band masking

5) 2 second window - high noise+no mask frequency band + low time band masking

By analysing the different tests, we can notice than a 2 second window results in 100% classification, but on device the 1.5 sec window is actually better. And that could be caused by the voice commands being closer to 1.5 seconds rather than 2 seconds for the time window.

Furthemore, I also noticed that the user only has 50-100ms of "lee-way" for speaking the command, there should not be silence at the beginning of the audio as it would result in a misclassified output.

This can be resolved by adding a delay in the code. Human reaction time to audio stimulus is 150-200ms, which is the delay I added.

I have also noticed that the library starts recording before the visual output, so I have increased the delay to 300ms.

The best model is the "1.5 second window - high noise+no mask frequency band + low time band masking", which results in amazing performance on the Arduino Nano 33 Ble Sense.

3) Aimed with all this knowledge, I went for a 3rd test. This time with 1150 samples per class, and with 2 extra voice commands, totalling 9 voice commands.

The voice commands are the following:

How is it going to be today? = ml weather classification, ml rainfall regression

What are the environment conditions? = Temp, feels like, dew point, humidity, relative humidity, absolute humidity, barometric pressure, sea-level pressure, water vapor pressure, dry air pressure

Details about my location = elevation, compass heading, compass roll, gravity vector

Count the number of steps done = total steps, since last command steps, calories burned, distance travelled

tell me the current time and date! = time, date

Present the health report! = Heart rate, skin temperature, heat index, discomfort index, pressure effect, air quality, voc, co2

AURI describe yourself! = project description

Do you know anything about the clouds? = clouds altitude, clouds temp

Thank you for the help! = conversation

The reasoning behind the "Do you know anything about the clouds command" is to separate is from the "What are the environment conditions?", as this command can be used indoors as well.

The reasoning behind the "Thank you for the help" is to increase the "human-like conversation" capabilities of the device.

Here, I have created 25 different versions, and performed in-depth analysis, however, here is my written logbook with all the comments per model and upgrades performed between rounds to skip about 40 images of feature spaces/training performances etc, the black circles represent the good models that have been taken further in the rounds:

Here is a slideshow of the 30 models/versions tested with their confusion matrix:

After all of this rounds I understood the following:

MFCC Block for Audio Processing: 9 Coefficients and a Window of 1.6 seconds
Best Architecture: 1D Conv Layer - 16 Neurons, 3 kernel size + dropout rate of 0.25, another 1D Conv Layer - 32 neurons, 3 kernel size + dropout rate 0.25, Flatten Layer
Best Training Settings: 140 training Cycles, Learning Rate 0.002, Test/Train Split 20% (this is because of the big dataset of over 4h of 1.5s samples).
Data Augmentation: High Noise + Low Time Bands Masking + Low Frequency Bands Masking, no time axis warp.

As shown here:

With the following Feature Space:

And with the following Confusion Matrix:

I was able to reach 99.79% validation accuracy with a loss of 0.0175 and 99.9% test set accuracy. All of that while having a classification time of the CNN of 8ms and peak ram usage of 6.5kb. Which is a huge increase from my first version of 1400ms processing time and 36kb ram usage.

I have performed about 15 tests with the 9 commands being repeated, here is a collage of 10 different tests, with not only 100% accuracy on all 15 tests, but also a very high certainty percentage of 98-100%, with rare cases of 89%.

And here is a video demonstration of the device in action:

Very happy that after such a long journey I can say that the model performs wonderfully on the device!

Here is a table with all models trained with their parameters, performance and comments, for this version of the CNN/Commands:

Voice Recognition - Final Base Version

Recent Posts

コメント