Benchmarking ASRs - Results from the first 'Out of the Box' Test on an open source dataset

04 Sep 2021

First ‘Out of the box’ results for 5 ASRs: VOSK vs Sphinx vs Coqui vs Google vs Amazon

1. Overview

In this post the testing procedure is discussed on the initial small dataset that is included with SpeechLoop. Speech systems are updated continuously so it’s sometimes hard to keep track of what the latest WER of each system is. This tool allows for easy benchmarking for a given dataset, to answer that question.

Although the test itself has a few limitations, it’s very small and the simplest which may be useful for those who want to quickly get up and running and evaluating different speech systems.

2. The Data

There are 78 WAVs of a single female person speaking very short nouns (mostly single words).

    FileCount: 78
    TotalLength: 00:01:15
    ReadType: read
    AudioQualityType: high
    AudioNoiseType: none
    MicQuality: high
    AudioDist: near-field
    Persons: solo
    Language: EN
    Accent: US
    Gender: female
    Filetype: wav
    AudioFile: Signed 16 bit Little Endian, Rate 16000 Hz, Mono

Although this isn’t a great comprehensive test on it’s own - it is another single data point that can help paint the picture of how good the ASR is in real world settings. The details on the dataset can be found here.

🎉 The dataset is open sourced under creative commons along with the SpeechLoop tool under Apache2 for anyone to download and repeat. I would appreciate anyone who validates and double checks and finds flaws in any part of the data/test.

Sometimes single word commands are harder for ASRs when using WER as the metric. This is for two reasons, firstly, because the utterances are usually very short there’s less information to evaluate (relies on the acoustic matching more so than pure unigram probabilities). Secondly, a single character when wrong can cause the word to be incorrect which contributes very large error to the overall score. This can make this a useful test to run!

3. Setup

Since this is the first benchmark, every step is repeated in detail so that others can follow.

Since we are planning to use GoogleASR and AWSTranscribe we need to setup the credentials. Suggested way is in the ~ /.bashrc:

#AWS
export AWS_ACCESS_KEY_ID=<accesskeyhere>
export AWS_SECRET_ACCESS_KEY=<secret_access_key_here>
#GOOGLE
export GOOGLE_APPLICATION_CREDENTIALS=/home/you/path/to/your/creds.json

Then re-source the bashrc with: source ~/.bashrc

Next, setup the repo as per the instructions:

git clone https://github.com/robmsmt/SpeechLoop && cd SpeechLoop
python3 -m venv venv/py3
source venv/py3/bin/activate
pip install -r requirements-dev.txt

Run:

cd speechloop
python main.py --input_csv='data/simple_test/simple_test.csv' --wanted_asr=all

4. ASRs

When we run the script we provide --wanted_asr=all this means (as of writing):

sp = Sphinx - pocketsphinx 0.1.15
vs = Vosk - based off of model en-small-us-0.15
cq = Coqui - based off of model v0.9.3
gg = GoogleASR - 21-09-06
aw = AwsTranscribe - 21-09-06

The last two being cloud ASRs so all we can use to distinguish is today’s date.

5. Results

After some minor tidying up of the dataframe, see the Jupyter notebook for details… a dataframe is generated which we can export to JSON, as seen below.

Something interesting to note, all ASRs mistranscribed the WAVs “xebec” and “oars”.

There were many words that ALL ASRs got correct e.g. “class”, “camera”, “nail”, “skateboard”, “windows”, and “hammer”.

6. Conclusion

In summary, after transcription corrections for each ASR output, the following are the results.

wer

Overall the results are somewhat expected. CloudASRs tend to dominate, with VOSK being very good modern alternative. Coqui had surprisingly bad results but it is expected to do better on longer sentences or with a customized language model.

For the next tests, we will attempt to tune each for the above dataset (where possible). As well as trying on everyone’s favourite opensource dataset…!

robmsmt curious about speech

Benchmarking ASRs - Results from the first 'Out of the Box' Test on an open source dataset

1. Overview

2. The Data

3. Setup

4. ASRs

5. Results

6. Conclusion

Related posts

Introducing SpeechLoop 14 Aug 2021

Audio from inside a Pandas Dataframe in a Jupyter Notebook 29 Jan 2021

Bash Tricks 22 Jan 2021