Isolated
Word Recognition
Overview:
This
project will develop a speech recognition system based
on the dynamic time warping approach. It is a
speaker dependent system developed and tested on my own
voice. The 'vocabulary' the system contains is the
numbers 0 through 9.
The Data was recorded using a NADY SCM 1000 condenser
microphone on the cardioid setting. It was
recorded at 8Khz and 16 bits linear encoding. The
complicated problem of end point detection is removed
from the project by manually doing endpoint detection by
simply saving single files of each test word. Five
versions of each word were recorded; the first four are
the test words and the last is the reference word.
Dynamic Time Warping:
In order to
understand DTW, two concepts need to be understood:
features- the information in the system represented in
some manner, and distances- some form of metric in order
to obtain a match path. In this system, the
features are bark scale frequency coefficients which
mimic the way the human ear perceives these words.
The local distances will be computed using the Euclidean
distance metric. This project uses frame based
feature extraction. We perform the filterbank
analysis over 25ms (200 points at fs=8Khz) with a 50%
overlap using a hamming window. The frequency bins
are averaged into bark scale bins and are averaged.
One of the difficulties in performing the comparison
between the test and the reference is that the words may
not have the same timing- one may be longer than the
other. Dynamic Time Warping accommodates these
differences by allowing a range of steps in time and
finding a path that maximizes the local match between
the aligned time frames.
Method:
The basic idea is
that the features are extracted, dynamic time warping is
used to compare frames, distances are computed, then
distances are compared to find a match. The
distance between the correct test and the actual word is
measured; if any of the other words have a smaller
distance, then the system fails. To help test the
system, the largest distance between the reference and
the actual is used so that the failure of the system
will be obvious if there is an incorrect match with
another word.
Using
Dynamic Time Warping and 5 version of 10 words, a total
of 490 comparisons are taking place. To get better
accuracy, more comparisons could be made. In
addition, other features could be extracted in addition
to the frequency coefficients to get a more fine-tuned
comparison.
Details:
- The files are read in, features are extracted
- The files are windowed and overlapped using the
previously mention figures
- The FFT is taken and truncated to useable frequencies
- The frequency bins are divided and averaged into a
Bark scale:
The following numbers are the highest frequencies in
each of the 17 bark bins
bark = [100 200 300 400
510 630 770 920 1080 1270 1480 1720 2000 2320 2700 3150
4000];
- These numbers are saved into a feature vector with 17
columns and the number of rows that is necessary to
capture all windows of the current file.
- If there is not enough samples for the last window of
the file, zeros are appended to the file to get an
identical FFT length compared to all other frames.
- The features are normalized so comparison to other
feature vectors is possible
- Now the known correct test files are dynamically
warped and compared to the reference. The maximum
of these 5 distances are taken to establish whether the
system recognizes any other word as a close match as
previously discussed.
- After the known correct files, the known incorrect
files are tested using the same processes.
- If there is an error, it is noted and the incorrect
number is identified.
- The accuracy is determined as 1 - (the number of
incorrectly identified numbers / total passes) * 100
- A confusion matrix is constructed showing the
predicted and actual outputs of the system
Results:
After finally getting the
MATLAB code working,
the accuracy was
determined to by 98.57%. This means that 7 numbers
were incorrectly identified out of 490 trials. The
confusion matrix is as follows:
|
|
|
|
|
|
|
Predicted |
|
|
|
|
|
|
|
Zero |
One |
Two |
Three |
Four |
Five |
Six |
Seven |
Eight |
Nine |
|
|
Zero |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
|
One |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
|
Two |
0 |
0 |
1 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
|
|
Three |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
|
Actual |
Four |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
|
|
Five |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
|
|
Six |
0 |
0 |
0 |
0 |
1 |
0 |
3 |
0 |
0 |
0 |
|
|
Seven |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
|
|
Eight |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
|
|
Nine |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
The system incorrectly identified two numbers.
Although the accuracy was very high, the system did not
identify the number two; it recognized it as three.
Also, the number nine and one were confused, but not to
the extent of three and two.
The confusion
of number nine and one is understandable since they are
indeed very similar. Over 50% of each word is
practically identical. On the other hand, three
and two are not easily confused.
I believe that
some of the error is due to the reduced bandwidth of the
signal. My voice is particularly dependent on
higher frequencies. My fricatives are not easy to
identify with only 4Khz of bandwidth.
Also, I believe there might have been an error in the
recording of the numbers. When I recorded the
numbers, I was close to the microphone, introducing the
proximity effect. This increased bass response
tends to 'muddy up' my recordings since I have a deep
voice and due to the low bandwidth. Vowels can
more easily be confused because the fundamental
frequency in my voice covers up the envelope of the
vowel sound.
If you download
the wav files of the numbers and listen to them, you can
listen to these possible errors.
Conclusions:
The Dynamic Time Warping
method was almost successful. The accuracy was
good, but I was not entirely happy with the results.
I'd like for it to work better and believe with some
tweaking and better recordings, it could work better,
but as is true with the end of every semester, I simply
do not have enough time to troubleshoot it any further.
Perhaps in the future, I will improve the results.
Click Here to download a zip
file with the numbers in wav format.
Email me at
m-lester@m-lester.com if you are interested in the
MATLAB code
|