Speech
Production Using Concatenated Acoustic Tubes
Overview:
This
project experiments with the production of the vowel
sounds /a/ /e/ /i/ /o/ and /u/. The production of
vowel sounds is a relatively process. The Glottis
produces a glottal pulse by opening slowing and slamming
shut must faster. The air rushing through this flap
creates a a frequency source which is filtered in the
vocal tract. The vocal tract is like a long tube
which has different widths depending on the distance
from the glottis to the lips. These different
sizes of cavities produces natural resonances and
filters which color the magnitude response of glottal
pulse. The different vowels have different vocal
tract configurations, hence different filters. An
approximate synthesis of vowels can be produced by using
concatenated tubes, calculating their resonances, and
applying digital filtering to a derived glottal pulse.
The sampling frequency for this experiment is chosen to
be 8Khz; a large enough bandwidth for understandable
speech.
Data:
The
shape of the vocal tract is difficult to observe since
technologies such as MRI are expensive and difficult to
view in real time. It is interesting to note that
the different vowels have different vocal tract lengths.
The widths of the vocal tract samples at every 0.5 cm
for the 5 vowels are as follows:
A=[5 5 5 5 6.5 8 8 8 8 8 8 8 8 6.5 5 4 3.2 1.6 2.6 2.6 2
1.6 1.3 1 0.65 0.65 0.65 1 1.6 2.6 4 1 1.3 1.6 2.6]
E=[8 8 5 5 4 2.6 2 2.6 2.6 3.2 4 4 4 5 5 6.5 8 6.5 8
10.5 10.5 10.5 10.5 10.5 8 8 6.5 6.5 6.5 6.5 1.3 1.6 2
2.6
I=[4 4 3.2 1.6 1.3 1 0.65 0.65 0.65 0.65 0.65 0.65 0.65
1.3 2.6 4 6.5 8 8 10.5 10.5 10.5 10.5 10.5 10.5 10.5
10.5 10.5 8 8 2 2 2.6 3.2];
O=[3.2 3.2 3.2 3.2 6.5 13 13 16 13 10.5 10.5 8 8 6.5 6.5
5 5 4 3.2 2 1.6 2.6 1.3 0.65 0.65 1 1 1.3 1.6 2 3.2 4 5
5 1.3 1.3 1.6 2.6];
U=[0.65 0.65 0.32 0.32 2 5 10.5 13 13 13 13 10.5 8 6.5 5
3.2 2.6 2 2 2 1.6 1.3 2 1.6 1 1 1 1.3 1.6 3.2 5 8 8 10.5
10.5 10.5 2 2 2.6 2.6];
Concatenated Tubes:
In order
to approximate the vocal tract using concatenated tubes,
you must choose the amount of formants that you want to
model. This requires choosing the amount of tube
samples to take. A good approximation of this is
having one formant per Khz of bandwidth. In
digital terms, this takes 2 tubes per Khz because the
poles that create the envelope of the filter occur in
complex conjugate pairs. The following equation
describes the amount of tubes needed: N = 2*Fs*L /
1000*c
L = total length;
c = speed of sound;
The
following graph shows the actual widths of the vocal
tract in blue circles and the calculated concatenated
tubes in red x's:

A
fictitious tube can be added to the end of the graph
which approximates the reflections from the lips to the
surrounding space where the sound is being produced.
A typical reflection coefficient value is 0.7, however I
found that a better coefficient is around 0.9.
This means that the area (cm squared) of the 'fictitious
tubes' would be as follows:
A = 34.7434
E= 40.78949
I= 52.7535
O= 42.2009
U= 74.86
Refection Coefficients:
The following
graph show the reflections coefficients for A through O:

The following
are the magnitude responses of the filters that create
the vowels with and without the fictitious tube:
Vowel A, no
fictitious tube
Vowel E, no
fictitious tube
Vowel I, no
fictitious tube
Vowel O, no
fictitious tube
Vowel U, no
fictitious tube
Vowel A, with
fictitious tube, reflection coefficient of 0.9
Vowel E, with
fictitious tube, reflection coefficient of 0.9
Vowel I, with
fictitious tube, reflection coefficient of 0.9
Vowel O, with
fictitious tube, reflection coefficient of 0.9
Vowel U, with
fictitious tube, reflection coefficient of 0.9
Adding the
radiation effect relating the pressure at the lips to
the volume velocity at the glottis enhances the quality
of the vowel production by high passing the output.
Although a person's voice might have a fundamental
frequency of 120Hz, but in general the voice is not that
bassy. The following are the filter magnitude
graphs including the radiation losses:
Vowel A with
radiation losses
Vowel E with
radiation losses
Vowel I with
radiation losses
Vowel O with
radiation losses
Vowel U with
radiation losses
Glottal Pulse:
The Glottal
Pulse is the excitation to the system. This plot
shows 6 periods of the pulse at 120Hz (Male Speaker)

The Magnitude
of this signal from zero hz to the sampling frequency is
as follows:

Output
Vowels:
Here are 6
periods of the Time Domain graphs of the synthesized
vowels:
The
corresponding magnitudes are as follows:

Output
Speech:
The following
are wav files of synthesized vowels:
Male Voice speaking /A/
Male Voice speaking /E/
Male Voice speaking /I/
Male Voice speaking /O/
Male Voice speaking /U/
Extrapolation of Female Voice:
The next
experiment is to change the fundamental frequency of the
glottal pulse to mimic a female speaker. However,
unlike a true female speaker, the size of the vocal
tract is not changed. This is similar to a male
breathing helium before speaking. The following
are the wav files of the synthesized vowels:
Female Glottal Pulse,
Male Vocal Tract speaking /A/
Female Glottal Pulse,
Male Vocal Tract speaking /E/
Female Glottal Pulse,
Male Vocal Tract speaking /I/
Female Glottal Pulse,
Male Vocal Tract speaking /O/
Female Glottal Pulse,
Male Vocal Tract speaking /U/
In order to
better estimate the female voice, the vocal tract must
also be shrunken. I shrunk the areas down to 60%
of the male values. By doing thing, there was no
difference in the output sounds. However, when
changing the amount of tubes to model the vocal tract
from 8 to 6, a different vowel was produced. Here
are the Female Vowels:
Female Glottal Pulse,
Female Vocal Tract speaking /A/
Female Glottal Pulse,
Female Vocal Tract speaking /E/
Female Glottal Pulse,
Female Vocal Tract speaking /I/
Female Glottal Pulse,
Female Vocal Tract speaking /O/
Female Glottal Pulse,
Female Vocal Tract speaking /U/
|