Speech Production Using Concatenated Acoustic Tubes

 

Overview:

This project experiments with the production of the vowel sounds /a/ /e/ /i/ /o/ and /u/.  The production of vowel sounds is a relatively process.  The Glottis produces a glottal pulse by opening slowing and slamming shut must faster. The air rushing through this flap creates a a frequency source which is filtered in the vocal tract.  The vocal tract is like a long tube which has different widths depending on the distance from the glottis to the lips.  These different sizes of cavities produces natural resonances and filters which color the magnitude response of glottal pulse.  The different vowels have different vocal tract configurations, hence different filters. An approximate synthesis of vowels can be produced by using concatenated tubes, calculating their resonances, and applying digital filtering to a derived glottal pulse.  The sampling frequency for this experiment is chosen to be 8Khz; a large enough bandwidth for understandable speech.

 

Data:

The shape of the vocal tract is difficult to observe since technologies such as MRI are expensive and difficult to view in real time.  It is interesting to note that the different vowels have different vocal tract lengths.  The widths of the vocal tract samples at every 0.5 cm for the 5 vowels are as follows:


A=[5 5 5 5 6.5 8 8 8 8 8 8 8 8 6.5 5 4 3.2 1.6 2.6 2.6 2 1.6 1.3 1 0.65 0.65 0.65 1 1.6 2.6 4 1 1.3 1.6 2.6]


E=[8 8 5 5 4 2.6 2 2.6 2.6 3.2 4 4 4 5 5 6.5 8 6.5 8 10.5 10.5 10.5 10.5 10.5 8 8 6.5 6.5 6.5 6.5 1.3 1.6 2 2.6


I=[4 4 3.2 1.6 1.3 1 0.65 0.65 0.65 0.65 0.65 0.65 0.65 1.3 2.6 4 6.5 8 8 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 8 8 2 2 2.6 3.2];


O=[3.2 3.2 3.2 3.2 6.5 13 13 16 13 10.5 10.5 8 8 6.5 6.5 5 5 4 3.2 2 1.6 2.6 1.3 0.65 0.65 1 1 1.3 1.6 2 3.2 4 5 5 1.3 1.3 1.6 2.6];

U=[0.65 0.65 0.32 0.32 2 5 10.5 13 13 13 13 10.5 8 6.5 5 3.2 2.6 2 2 2 1.6 1.3 2 1.6 1 1 1 1.3 1.6 3.2 5 8 8 10.5 10.5 10.5 2 2 2.6 2.6];

 

Concatenated Tubes:

In order to approximate the vocal tract using concatenated tubes, you must choose the amount of formants that you want to model.  This requires choosing the amount of tube samples to take.  A good approximation of this is having one formant per Khz of bandwidth.  In digital terms, this takes 2 tubes per Khz because the poles that create the envelope of the filter occur in complex conjugate pairs.  The following equation describes the amount of tubes needed:  N = 2*Fs*L / 1000*c
L = total length;
c = speed of sound;

 

The following graph shows the actual widths of the vocal tract in blue circles and the calculated concatenated tubes in red x's:

A fictitious tube can be added to the end of the graph which approximates the reflections from the lips to the surrounding space where the sound is being produced.  A typical reflection coefficient value is 0.7, however I found that a better coefficient is around 0.9.  This means that the area (cm squared) of the 'fictitious tubes' would be as follows:
A =   34.7434
E= 40.78949
I= 52.7535
O= 42.2009
U= 74.86

Refection Coefficients:

The following graph show the reflections coefficients for A through O:

The following are the magnitude responses of the filters that create the vowels with and without the fictitious tube:

Vowel A, no fictitious tube
Vowel E, no fictitious tube
Vowel I, no fictitious tube
Vowel O, no fictitious tube
Vowel U, no fictitious tube

Vowel A, with fictitious tube, reflection coefficient of 0.9
Vowel E, with fictitious tube, reflection coefficient of 0.9
Vowel I, with fictitious tube, reflection coefficient of 0.9
Vowel O, with fictitious tube, reflection coefficient of 0.9
Vowel U, with fictitious tube, reflection coefficient of 0.9

Adding the radiation effect relating the pressure at the lips to the volume velocity at the glottis enhances the quality of the vowel production by high passing the output.  Although a person's voice might have a fundamental frequency of 120Hz, but in general the voice is not that bassy.  The following are the filter magnitude graphs including the radiation losses:

Vowel A with radiation losses
Vowel E with radiation losses
Vowel I with radiation losses
Vowel O with radiation losses
Vowel U with radiation losses

Glottal Pulse:

The Glottal Pulse is the excitation to the system.  This plot shows 6 periods of the pulse at 120Hz (Male Speaker)

The Magnitude of this signal from zero hz to the sampling frequency is as follows:

 

Output Vowels:

Here are 6 periods of the Time Domain graphs of the synthesized vowels:

The corresponding magnitudes are as follows:

Output Speech:

The following are wav files of synthesized vowels:

Male Voice speaking /A/
Male Voice speaking /E/
Male Voice speaking /I/
Male Voice speaking /O/
Male Voice speaking /U/

Extrapolation of Female Voice:

The next experiment is to change the fundamental frequency of the glottal pulse to mimic a female speaker.  However, unlike a true female speaker, the size of the vocal tract is not changed.  This is similar to a male breathing helium before speaking.  The following are the wav files of the synthesized vowels:

Female Glottal Pulse, Male Vocal Tract speaking /A/
Female Glottal Pulse, Male Vocal Tract speaking /E/
Female Glottal Pulse, Male Vocal Tract speaking /I/
Female Glottal Pulse, Male Vocal Tract speaking /O/
Female Glottal Pulse, Male Vocal Tract speaking /U/

In order to better estimate the female voice, the vocal tract must also be shrunken.  I shrunk the areas down to 60% of the male values.  By doing thing, there was no difference in the output sounds.  However, when changing the amount of tubes to model the vocal tract from 8 to 6, a different vowel was produced.  Here are the Female Vowels:

Female Glottal Pulse, Female Vocal Tract speaking /A/
Female Glottal Pulse, Female Vocal Tract speaking /E/
Female Glottal Pulse, Female Vocal Tract speaking /I/
Female Glottal Pulse, Female Vocal Tract speaking /O/
Female Glottal Pulse, Female Vocal Tract speaking /U/

 


  

 

   © Copyright M-lester.com