Transfer Learning - A different Voice Recognition Approach

Ifrim Ciprian
Mar 22, 2022
4 min read

As can be seen from the last 2 blog posts on the voice recongnition model, the best version reached 99% accuracy and was able to recognise my voice commands perfectly when tested on PC.

However, once moved to the Arduino Nano 33 Ble Sense, the performance was quite poor.

This prompted me to investigate why and then even speak to some of the Edge Impulse SDK developers, over on their forums.

The thread can be found at the following link: https://forum.edgeimpulse.com/t/webiste-accuracy-higher-than-on-nano-33-ble-sense-speech/3987

Our discussion can be summarised as follows: The Arduino Nano 33 Ble Sense, and speech recognition CNNs in general, require a very high dataset, with over 1000 samples per each command, my 50 samples per class is extremely little. However, the process can be changed from a CNN model to Transfer Learning, which is a model already trained and useable on very low datasets.

Transfer Learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task. To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.

Edge Impulse uses a model called "MobileNetV1" with a final layer of 128 neurons and a dropout rate of 0.1 (which helps with the overfitting of the model). MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used.

The developers at Edge Impulse created an optimised version specifically for the Arduino Nano 33 Ble Sense.

By using the same voice commands from before, but the 18 classes version (with only 1 class for noise) we can try this approach of transfer learning. It is important to note that Transfer Learning in this case, only works with MFE, not MFCC.

MFE extracts a spectrogram from audio signals using Mel-filterbank energy features, great for non-voice audio.

And the model also requires a 1 second window for audio recognition, plus 10 coefficients. These parameters cannot be changed!

The MFE feature space looks as follows:

1) By performing all the processing, and using a number of training cycles of 100, with a learning rate of 0.005, and a validation test split of 40%. The model results in an accuracy of 95.7% with a loss of 0.38.

2) Using a number of training cycles of 100, with a learning rate of 0.002, and a validation test split of 50%. The model results in an accuracy of 94.6% with a loss of 0.43.

3) By playing with the training settings, I have reached the best case of 100 training cycles, 0.1 training rate and 60% validation test split.

This resulted in 95.9% accuracy with 0.40 loss.

However, I was encountering the same issue once again, where the model would be very accurate on PC, but resulting in bad performance on the device.

So I have tried a different MobileNet mode, MobileNetV2 with 128 neurons, and 0.1 dropout, which has been trained differently and is more capable in classifing with a very low number of samples.

However, the MobileNetV2 is NOT optimised for the Arduino Nano at all, so it must be ran with the RAM available if possible.

After using the best parameters/training settings from the V1 model, and changing to the V2 model, it resulted in the following output:

We can see that the accuracy is higher and the loss is way better. So I have run this model in the browser mode, and it was incredible, it was looking flawless, however, after porting it to the Arduino Nano 33 Ble Sense, sadly because it was not optimised, it actually requires more ram than the device has on board.

So I went back and retrained the model, but I lowered the final layer neurons to 64 and 32(2 different versions), to reduce the ram needed.

Here are the parameters:

And here is the confusion matrix:

We can see that even with the lower neurons for the final layer, the model performs exactly the same as the model with 128.

However, once the library was built and uploaded to the model, sadly the TinyML model couldn't be loaded because the classifier needs more ram than what is available.

To conclude, although Transfer Learning for Keyword Spotting is a way better method for classifing speech, it requires way more ram, which is a limitation of the small edge devices such as the Arduino Nano 33 Ble Sense. Furthermore, the Keyword Spotting requires 1 word in maximum 1 second, it should be used with a mix of word commands and phrase commands.

Although by transforming the samples into singular keywords and increasing the samples from 50 to 200 would've most probably created a very accurate model on the device itself, I decided to abandon this approach as I was very limited in what I could manually change, and I would prefer to have "human like conversation commands", which aligns with my research and survey (performed in the Literature Review) on how users/consumers are more inclined to use a speech recognition/voice assistant system if it sounds humane and it can hold a conversation.

Transfer Learning - A different Voice Recognition Approach

Recent Posts

Comments