Isolated Word Recognition

 

Overview:

This project will develop a speech recognition system based on the dynamic time warping approach.  It is a speaker dependent system developed and tested on my own voice.  The 'vocabulary' the system contains is the numbers 0 through 9. 

The Data was recorded using a NADY SCM 1000 condenser microphone on the cardioid setting.  It was recorded at 8Khz and 16 bits linear encoding.  The complicated problem of end point detection is removed from the project by manually doing endpoint detection by simply saving single files of each test word.  Five versions of each word were recorded; the first four are the test words and the last is the reference word. 

Dynamic Time Warping:
In order to understand DTW, two concepts need to be understood: features- the information in the system represented in some manner, and distances- some form of metric in order to obtain a match path.  In this system, the features are bark scale frequency coefficients which mimic the way the human ear perceives these words.  The local distances will be computed using the Euclidean distance metric.  This project uses frame based feature extraction.  We perform the filterbank analysis over 25ms (200 points at fs=8Khz) with a 50% overlap using a hamming window.  The frequency bins are averaged into bark scale bins and are averaged. 

One of the difficulties in performing the comparison between the test and the reference is that the words may not have the same timing- one may be longer than the other.  Dynamic Time Warping accommodates these differences by allowing a range of steps in time and finding a path that maximizes the local match between the aligned time frames.

Method:
The basic idea is that the features are extracted, dynamic time warping is used to compare frames, distances are computed, then distances are compared to find a match.  The distance between the correct test and the actual word is measured; if any of the other words have a smaller distance, then the system fails.  To help test the system, the largest distance between the reference and the actual is used so that the failure of the system will be obvious if there is an incorrect match with another word. 

Using Dynamic Time Warping and 5 version of 10 words, a total of 490 comparisons are taking place.  To get better accuracy, more comparisons could be made.  In addition, other features could be extracted in addition to the frequency coefficients to get a more fine-tuned comparison.

 Details:
- The files are read in, features are extracted
- The files are windowed and overlapped using the previously mention figures 
- The FFT is taken and truncated to useable frequencies
- The frequency bins are divided and averaged into a Bark scale:
     The following numbers are the highest frequencies in each of the 17 bark bins
    
bark = [100 200 300 400 510 630 770 920 1080 1270 1480 1720 2000 2320 2700 3150 4000];
- These numbers are saved into a feature vector with 17 columns and the number of rows that is necessary to capture all windows of the current file.
- If there is not enough samples for the last window of the file, zeros are appended to the file to get an identical FFT length compared to all other frames.
- The features are normalized so comparison to other feature vectors is possible
- Now the known correct test files are dynamically warped and compared to the reference.  The maximum of these 5 distances are taken to establish whether the system recognizes any other word as a close match as previously discussed.
- After the known correct files, the known incorrect files are tested using the same processes. 
- If there is an error, it is noted and the incorrect number is identified. 
- The accuracy is determined as 1 - (the number of incorrectly identified numbers / total passes) * 100
- A confusion matrix is constructed showing the predicted and actual outputs of the system
 

Results:
After finally getting the MATLAB code working, the accuracy was determined to by 98.57%.  This means that 7 numbers were incorrectly identified out of 490 trials.  The confusion matrix is as follows:

 

 

 

 

 

 

Predicted

 

 

 

 

 

 

Zero

One

Two

Three

Four

Five

Six

Seven

Eight

Nine

 

Zero

4

0

0

0

0

0

0

0

0

0

 

One

0

4

0

0

0

0

0

0

0

0

 

Two

0

0

1

3

0

0

0

0

0

0

 

Three

0

0

0

4

0

0

0

0

0

0

Actual

Four

0

0

0

0

4

0

0

0

0

0

 

Five

0

0

0

0

0

4

0

0

0

0

 

Six

0

0

0

0

1

0

3

0

0

0

 

Seven

0

0

0

0

0

0

0

4

0

0

 

Eight

0

0

0

0

0

0

0

0

4

0

 

Nine

0

2

0

0

0

0

0

0

0

2


The system incorrectly identified two numbers.   Although the accuracy was very high, the system did not identify the number two; it recognized it as three.  Also, the number nine and one were confused, but not to the extent of three and two. 

The confusion of number nine and one is understandable since they are indeed very similar.  Over 50% of each word is practically identical.  On the other hand, three and two are not easily confused.

I believe that some of the error is due to the reduced bandwidth of the signal.  My voice is particularly dependent on higher frequencies.  My fricatives are not easy to identify with only 4Khz of bandwidth. 
Also, I believe there might have been an error in the recording of the numbers.  When I recorded the numbers, I was close to the microphone, introducing the proximity effect.  This increased bass response tends to 'muddy up' my recordings since I have a deep voice and due to the low bandwidth.  Vowels can more easily be confused because the fundamental frequency in my voice covers up the envelope of the vowel sound.

If you download the wav files of the numbers and listen to them, you can listen to these possible errors.

Conclusions:
The Dynamic Time Warping method was almost successful.  The accuracy was good, but I was not entirely happy with the results.  I'd like for it to work better and believe with some tweaking and better recordings, it could work better, but as is true with the end of every semester, I simply do not have enough time to troubleshoot it any further.  Perhaps in the future, I will improve the results. 

Click Here to download a zip file with the numbers in wav format.

Email me at m-lester@m-lester.com if you are interested in the MATLAB code

 

 

 

 


  

 

   © Copyright M-lester.com