Speech recognition, custom vocabulary, language modeling with deep Learning

Okay, so this is a new post after a hell lot of days. last was this one about currency trading. Yeah! my subjects are wild. Probably 10 months nearly, mostly because I was doing nothing. Killing time on weekdays, and getting drunk on weekends (weekend part true though, still now) were my only activities. Anyway, recently I have been working on speech recognition. Getting my hands dirty with both traditional approach and deep learning based approach. So the first things that came to mind was Kaldi. Although traditional approach to any problem these days has no glamour, but still they have at least some advantage. For example,

  • You will not get random words as output. This thing in practical case might be an advantage or mostly disadvantage. Because in most business scenario people don’t scream random shit. You want to see what I mean… Go to Nirvana Deepspeech implementation trained on WSJ dataset. I’ll also give you a glimpse.
Ground truth Model output
united presidential is a life insurance company younited presidentiol is a lefe in surance company
that was certainly true last week that was sertainly true last week
we’re not ready to say we’re in technical default a spokesman said we’re now ready to say we’re intechnical default a spokesman said

So if you don’t want that, you should also teach your model to write english words in written form e.g. it should produce united not younited. So now you are starting to see the problem of half baked deep learning model. Now I’ll go even deeper,

  • A very important criteria with any model is how well it generalize? I train my model with audio books data (say librispeech) will the model know that there exists a word called Nvidia.

So the problem is, how my model can understand Domain Specific terms. The solution is easy, retraining the model. So if I’m using DeepSpeech kind of solution, I need to retrain the whole deep learning network. Which will at least take days even with having some beasts at your disposal (GTX 1080Ti, Quadro). But with traditional Kaldi approach we just need to recreate language model. Now let me explain me explain what the difference between Kaldi approach and Deepspeech approach.



Here, in Kaldi, the finite state transducer uses the the language model and find out the the meaningful word, from phonemes (A phoneme is one of the units of sound that distinguish one word from another in a particular language, as per wikipedia). The disadvantage is you need a language dictionary with meaningful words with their correct phonetic pronunciation and it will never produce an word outside it. But in all practicality, it produces better result (if you are not Baidu, they claims near 10% WER when trained with their own data). Because all half baked Deepspeech model often give words, that doesn’t mean anything at all (see the same example above).

Fair enough, Now it is clear from the above diagram that when you are using Kaldi you are using two models

  1. First one convert your speech to corresponding phonetics.
  2. Second one, you map your phonetics form to written english words. In this stage you need a,
    • Dictionary, that maps your individual words to their corresponding phonetics e.g.
      nvidia -> eh n v ih d iy ah

      and similar words. Ideally, the dictionary should contain all the words used in a language. (Now, How will you get this? I’ll explaining in a moment)

    • Language Model. This part is important. This model describes, how all the words available in dictionary is used in real world. Lets understand it with example. The language model described in Kaldi is stored $KALDI_HOME/egs/aspire/s5/data/local/lm/3gram-mincount file by default. If you open the file and go to 3-grams section you will see something like this,
-0.53045 kinda fit together

-1.47707 kinda a hard

 -1.97360 kinda a high

Now, what this means is extremely simple. The number at the start denotes the probability of the 3 words coming together in that language, in log (base 10) e.g. the first line in the above example means the probablity of occuring kinda fit together simultaneously in the specific language (in our case english) is 10-0.53045 i.e. 0.2948. So on for all other n-grams (2,3,4,5,6-grams). if you want to go through the details see here.

You can clearly understand as long as the language pronunced in the same way, you do not need any speech sample at all to retrain a kaldi model.

Now I will create a fresh language model from from wiki corpus to explain how the whole thing work. But before that let me give you a fair warnning about using wikipedia corpus. The way we speak a language and the way we write the same language, are entirely different. So in production system we should not use a corpus of written article in speech recognition.

One more tiny thing before we move on, once you download the wiki corpus (from any source), you need to remove all html/xml tag from the text and combine all files (if there are multiple files) to a single text file called corpus.txt.

As explained before, let us first create the dictionary for our speech recognition system. First lets get all the unique words that means something in that language,

$ grep -oE "[A-Za-z\\-\\']{1,}" corpus.txt | tr '[:lower:]' '[:upper:]' | sort | uniq > words.txt

This command should create a file words.txt with all the unique words used. Great, now we need to tell how all the words are pronunced? The ideal way to do that would be manually mapping all the pronunciation to the corresponding words. Now to be frank, the work is tedius and boaring and we are smarter. So we will use another machine learning tool to generate pronunciation. Just install a g2p package. I’ll use this one, installation steps already mentioned here.

This is supposed to take sometime, leave it for the night. Once it is complete, you will be able to see a words.dic file containing all the unique words and their corresponding pronunciation.
Whoa! our initial directory for our language is ready. Now lets move on and create a language model,

That’s it. The new file named lm.arpa is your very own language model. Now to avoid confusion, I will create a new directory to store our model, copy our dictionary and language model, phones. Then I’ll create a graph to be used with OpenFST,

This is also supposed to take some time. Once done your very own Automatic Speech Recognition system is ready. To run a online speech recognition just create a new script called run_new_model.sh in $KALDI_HOME/egs/aspire/s5 directory.

It’s ready to go on. Just give a give a command,


$ ./run_new_model.sh my_audio.wav

You’ll get your text from the audio. Only one thing to remember. Your audio needs to be of 8k sampling rate and 1 channel. That’s a requirement for aspire model itself. If you are not sure, just run the following command,


$ ffmpeg -i <your_original_file.wav> -acodec pcm_s16le -ac 1 -ar 8000 your_8khz_file.wav

If ffmpeg is not already installed, just install it. It might be a very handy tool if you really want to explore speech and sound,

$ sudo apt-get install ffmpeg

Lastly you can read more about aspire model here and some of the code here is borrowed from there.

Leave a Reply

Your email address will not be published.