Automatic Speech Recognition Pipeline

7 min readMay 25, 2021

Motivation

The goal of this post it to share our passion and knowledge around deep learning. What fascinates us in particular is using these complex systems for the greater good.

The key contribution is the explanation of how to bring the different parts in an Automatic Speech Recognition (ASR) system together. Thus, we briefly explain the setup and then focus for the rest on the building blocks of the pipeline and their training. The post concludes with our learnings and potential applications for ASR systems.

This is a joint work with Andreas Schlaginhaufen

Challenge

In Senegal around 50% of the population is illiterate which is a major problem when for example using public transport. The Automatic Speech Recognition in WOLOF challenge addresses this problem by developing an algorithm to recognise speech and output text.

WOLOF is an official language in a few countries in West Africa with 5.5M native speakers. The dataset consists of around 6500 short samples of audio recordings (~10h) with mainly public transport stops.

Pipeline Building Blocks

The pipeline consists of three distinct building blocks with the pre-processing, a speech to text model and a subsequent post-processing of the probability predictions.

Pre-processing

The input audio samples need to be adjusted to a 16kHz sampling frequency for the specific speech to text model choice (wav2vec 2.0). When using batches (i.e. in training) the input needs to be padded to the same length. The text labels are lower cased and special characters are removed.

Speech to Text Model

In ASR, the state-of-the-art are transformers pretrained in an unsupervised fashion. Facebook’s recently released wav2vec 2.0 model has achieved a breakthrough in terms of minimising Word Error Rate (WER). It has been pre-trained on ~1000 hrs of English audio recordings (with augmentation ~50k hrs) requiring 64 V100 GPUs for 1.6 days. This is a major problem for two reasons: (a) for most languages there does not exist a high quality data set of this size and (b) most individuals do not have the necessary hardware resources available.

To address these two problems, Facebook pretrained the same architecture on an assembly of over 50 languages in Unsupervised Cross-Lingual Representation Learning for Speech Recognition. The pre-trained models are available here with variants for different languages. They observe that with as little as 1 hr of data fine-tuning achieves WER below 10%. For more details about the learning algorithm, checkout the official Facebook blog post.

The model architecture starts with a temporal convolution feature extractor followed by a transformer architecture with multiple layers and attention heads. Pre-trained weights are available here.

Post-processing

It is a common approach to include a Language Model (LM) to further improve the prediction i.e. reduce the WER. The LM is trained on a corpus of the specific language to predict the next word given some context. In this way, the syntax of the language is learned and fused into the predictions.

Training

We discuss the assembly of the pipeline in this section as it is closely interwined with training. As a reminder, the pipeline can be divided into the pre-processing, the speech to text model as well as the post-processing.

Pre-Processing

The sampling to 16kHz is performed with the librosa library. For the input augmentation, we applied SpecAugment which uses a combination of time masking, frequency masking and time warping.

For the pre-processing, the huggingface library provides the Wav2Vec2CTCTokenizer to tokenize the text (predictions). For data sets that fit well into the GPU memory, this step can be performed prior to the training loop to allow for fast loading.

Speech to Text Model

The vocabulary size determines the output dimension as the model provides a probability distribution (softmax) over the vocabulary. We have chosen a character level output for two main reasons. On one side, the last layer is easier to learn since the output dimension is considerably smaller (~50 for character output vs ~10k for word output with truncation). On the other side, with the characters of an alphabet you can in theory reconstruct words unseen at training time.

The huggingface library also contains Wav2Vec2ForCTC, which allows to restore pretrained models and train them with the Connectionist Temporal Classification (CTC). We used the huggingface Trainer for the training, which is a wrapper taking the model, dataset and the metric as an argument. For readers interested in specificially the fine-tuning, we can recommend following post.

For the logging, we have opted for wandb which is a super cool developer tool that tracks all kind of parameters and metrics plus has hyper-parameter search integrated (grid search, Bayesian, random search).

To achieve optimal performance with the fine-tuning, we performed a random search over the initial learning rate, the dropout probability and the mask time probability. Additionally, different learning rate schedules have been investigated such as constant, linear decay and cosine. The best performing hyperparameters are as follows:

initial learning rate = 4.4e-4
dropout probability = 0.024 (attention, hidden, layer)
mask time probability = 0.057

with a linear decaying learning rate.

Post-Processing

The purpose of this block is to more explicitly encode and exploit the syntax of the language to improve the predictions further. We have experimented with n-grams both on a character and a word level. The word level n-grams outperformed the character level n-grams. Another study has shown that n=3 achieves the best results. There is always a trade-off involved as larger n improves performance but also requires much more data since an exponentially growing look up table is established with dimension vocabulary^n. The language model has been trained directly on the data set. It should be noted though that the best practice would be to train it on a large corpus of a language. We decided not to do that because the data is a mix between Wolof and French (latter is used for the public transport stops). Thus, it was not immediately obvious how to train it on both languages simultaneously besides that there is not a large online corpus for Wolof available.

To understand how the language model is used, we note that the output of the Wav2vec 2.0 model is a probability distribution for each time step over the vocabulary. In a first step, a beam search is used to extract N=50 high probability beams (i.e. sentences) as predicted by the ASR model. The language model is then used to evaluate the probability for each of those beams such that the beam with the highest probability can be chosen as the prediction.

A third component that has been used is a nearest neighbour search of the predicted words in the trainings data. The difflib get_closest_match function has been applied to find the nearest neighbour. This lowered the WER further by 1% -3% whereas less optimal models benefited even more.

Hardware

The model consists of ~300M parameters and training requires to accumulate the gradients of multiple steps. Thus, for training a GPU with approximately 12GB of memory should be at your disposal.

Google Colab is a viable option for users not having access to a powerful enough GPU. Training on this hardware takes around 3hrs for 20 epochs with the ~6500 training samples.

Inference

Disclaimer: we observed a mismatch between the performance on the validation set versus the test set. The source is not clear but it could potentially be a mismatch in the data set distribution. On the test set, we have achieved a WER of 6.8% which is around state of the art.

On the validation set, the model without any post-processing achieves a WER of 3.9%. The post-processing block i.e. beam search, language model and nearest neighbour search over the vocabulary brings the WER down to around 2.3%. This demonstrates the effectiveness of explicitly exploiting the structure of the language.

Conclusion

This article has discussed an automatic speech recognition pipeline for a low resource language. It outlined the different components with

pre-processing by adjusting the sampling frequency and tokenization
passing the input through the speech to text model
post-processing with a language model in conjunction with beam search
post-processing by a nearest neighbour search of the prediction over the vocabulary extracted from the training data

Our key learnings are to divide the pipeline into its subparts for trouble-shooting and optimisation. It’s also important to perform some initial exploratory data analysis to get a feeling for the data set and its particularities.

Application areas could and already do range from smart home solutions, over productivity assistants to support for illiterate or blind people. The probably most famous examples are Siri, Alexa and Google Assistant.

If you find any typos or have further questions, we encourage you to leave a comment. A link to the Github repo with the full code will follow soon. Stay tuned.