Representing speech for machine learning models,Spectrogram, MFCC . Feature extraction

Machine Learning is easy, at least, on a superficial level. You have some numerical array (maybe you call them tensors, because of high dimensionality, sometimes). Thats your input. Sometimes you throw away some features, if you have too many dimensions. Then you import some model from some library and just call, y_train) That’s it! Your so hyped ML model is ready that can predict something.

Everything sounds simple enough! Let me complicate the thing a little bit. You want to predict label in Iris dataset, that’s easy. You can feed the input data to your model as is, it’s easy. Now just think about sound as input. Say I want to decide the tone of the voice sample has which emotion. How would I give the audio input to my model.

Let’s take an idea. Run the following code with your favourite audio file,

For 10 sec file sampled @ 44KHz you will have an array of 440,000 dimension. Now most of the data is useless, just like most of the pixels in image file. So although you can directly give your raw array as input to your model, chances are high you are not going to get anything useful from it.

This is the reason use use more efficient representations of speech, mostly Spectrogramor MFCC (Well they are almost the same thing in physical sense).

Some things needs to be remembered here i.e. if we are using compressed feature-set. We are obviously discarding some features. But the hope is, we discard most of the noise and keep most of the useful features, when using a encoding technique. But, still the question remains, at least from a superficial level, if we keep all the feature (assuming cost/hardware is not an issue) will it give better result. Maybe you should start at Curse of Dimensionality, if you want to study this in detail.

So lets start with, what is a spectrogram? it’s simple,

  1. first, it decomposes a signal into its component frequencies,
  2. Now, it plots those frequencies over time,

For example take this,


The left part, is a wave at 200HZ, the way we usually see in basic science book. The right representation actually shows, which frequencies are present at what time. Still this is not enough, because in practical life not all frequencies present in a signal have same amplitude (That’s how we understand, foreground sound and all other background noise). So all spectrograms are actually a 3-dimensional array that represents, a signal over time. To visually represent it, we just color code a spectrogram to represent a 3D array in 2D monitor. For example take this one,


This is a real taken example taken from a kaggle notebook. Here the different colors denotes different levels of amplitude. Just one thing to remember, in most practical cases, we use log scale for amplitude (i.e. the colors of a spectrogram). That’s it. Almost everything we do in sound (or any kind of signal processing) starts from here.

One criteria for generating spectrogram is getting all component frequencies of a signal and their corresponding amplitudes. To find that, we need some math. Well, there is at least something useful I learned in first Engineering Math paper (Although I didn’t realize it that time). There is some magic named Fourier transform. Honestly you really don’t need to know what calculation Fourier transform does under the hood, because you will always have some helper function to take all the heavy lifting of calculation spectrogram.

So what kind of special job MFCC does, if spectrogram explains everything. Lets go a bit deeper.

Normally we express frequencies in Hertz scale e.g. the way we plotted the above spectrogram. That is fine sometimes, but our hearing is not equally sensitive to all frequencies. Mel scale is another scale of measurement just like Heartz. But by definition, Mel Scale takes human perception of audio signal into account.

What this gibberish means is, the numbers expressed in Mel Scale are more representative of human perception. Thus Mel Scale representation usually makes computers understand speech more like human, less like machines.

What happens in MFCC is we derive our audio encoding from a spectrogram expressed in Mel Scale, instead of Heartz (Yes! there are more steps involved in actual scenario, but this is the step that makes it more like human).

There are a lot of encoding techniques available, except MFCC. But I tried to explain MFCC, because it give nice result in most the cases. And honestly, we don’t need to work on signal representation, unless we are working Signal Engineering itself, not Machine Learning. For ML engineers MFCC does just fine in most of the cases.

Anyway, this is already 3 in morning. And it’s time to finish this post. I’ll try to write about other things, as an when I come across.

Leave a Reply

Your email address will not be published.