16 Pages
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer


Expression control using synthetic speech.Brian Wyvill ( and David R. Hill ( of Computer Science, University of Calgary.2500 University Drive N.W.Calgary, Alberta, Canada, T2N 1N4AbstractThis tutorial paper presents a practical guide to animating facial expressions synchronised to a rule based speech synthesiser. A description of speech synthesis by rules is given and how a set of parameters which drive both the speech synthesis and the graphics is derived. An example animation is described along with the outstanding problems.Key words: Computer Graphics, Animation, Speech Synthesis, Face-Animation.© ACM, 1989. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published as Course # 22 of the Tutorial Section of ACM SIGGRAPH 89, Boston, Massachusetts, 31 July - 4 August 1989. DOI unknown.Note (drh 2008): Appendix A was added, after publication of these tutorial notes by the ACM, to flesh out some details of the parameter synthesis, and to provide a more complete acoustic parameter table (the original garbled table headings have been corrected in the original paper text that follows, but the data was still incomplete—intended for discussion in a tutorial).Fairly soon after the animation work using the formant synthesiser was finished a completely new articulatory speech synthesis system was developed by ...



Published by
Reads 42
Language English
Report a problem

Expression control using synthetic speech.
Brian Wyvill ( and David R. Hill (
Department of Computer Science, University of Calgary.
2500 University Drive N.W.
Calgary, Alberta, Canada, T2N 1N4
This tutorial paper presents a practical guide to animating facial expressions synchronised to
a rule based speech synthesiser. A description of speech synthesis by rules is given and how a
set of parameters which drive both the speech synthesis and the graphics is derived. An example
animation is described along with the outstanding problems.
Key words: Computer Graphics, Animation, Speech Synthesis, Face-Animation.
© ACM, 1989. This is the authors’ version of the work. It is posted here by permission of ACM
for your personal use. Not for redistribution. The definitive version was published as Course #
22 of the Tutorial Section of ACM SIGGRAPH 89, Boston, Massachusetts, 31 July - 4 August
1989. DOI unknown.
Note (drh 2008): Appendix A was added, after publication of these tutorial notes by the ACM,
to flesh out some details of the parameter synthesis, and to provide a more complete acoustic
parameter table (the original garbled table headings have been corrected in the original paper
text that follows, but the data was still incomplete—intended for discussion in a tutorial).
Fairly soon after the animation work using the formant synthesiser was finished a completely
new articulatory speech synthesis system was developed by one of the authors and his colleagues.
This system uses an acoustic tube model of the human vocal tract, with associated new posture
databases—cast in terms of tube radii and excitation, new rules, and so on. Originally a
technology spin-off company product, the system was the first complete articulatory real-time
text-to-speech synthesis system in the world and was described in [Hill 95]. All the software
is now available from the GNU project gnuspeech under a General Public Licence. Originally
developed on the NeXT computer, much of the system has since been ported to the Macintosh
under OS/X, and work on a GNU/Linux version running under GNUStep is well under way
[Hill 95] David R. Hill, Leonard Manzara, Craig-Richard Schock. Real-time articulatory speech-
synthesis-by-rule, Proc. AVIOS ‘95, the 14th Annual International Voice Technologies Applications
Conference of the American Voice I/O Society, San Jose, September 11-14 1995, AVIOS: San Jose,
27-44Expression control using synthetic speech 1
1 Motivation
In traditional hand animation synchronisation between graphics and speech has been achieved
through a tedious process of analysing a speech sound track and drawing corresponding mouth
positions (and expressions) at key frames. To achieve a more realistic correspondence a live
actor may be filmed to obtain the correct mouth positions. This method produces good results
but must be repeated for each new speech, it is time consuming and requires a great deal of
specialised skill on the part of the animator. A common approach to computer animation uses a
similar analysis to derive key sounds, from which parameters to drive a face model can he found.
(see [Parke 74]) Such an approach to animation is more flexible than the traditional hand method
since the parameters to drive such a face model correspond to the key measurements available
from the photographs directly, rather than requiring the animator to design each expression
as needed. However, the process is not automatic, requiring tedious manual procedures for
recording and measuring the actor.
In our research we were interested in finding a fully automatic way of producing an animated
face to match speech. Given a recording of an actor speaking the appropriate script, it might
seem possible to design a machine procedure to recognise the individual sounds and to use
acoustic-phonetic and articulatory rules to derive sets of parameters to drive the Parke face
model. However, this would require a more sophisticated speech recognition program than is
currently available.
The simplest way for a computer animator to interact with such a system would be to type
in a line of text and have the synthesised speech and expressions automatically generated. This
was the approach we decided to try.
From the initial input, given the still incomplete state of knowledge concerning speech
synthesis by rules, we wanted to allow some audio editing to allow improvements in the
quality, with the corresponding changes to the expressions being done automatically. Synthetic
speech by rules was the most appropriate choice since this can be generated from keyboard
input, it is a very general approach which lends itself to the purely automatic generation of
speech animation. The major drawback is that speech synthesised in this manner is far from
2 Background
2.1 The Basis for Synthesis by Rules
Acoustic-phonetic research into the composition of spoken English during the 50’s and 60’s, led
to the determination of the basic acoustic cues associated with forty or so sound classes. This
early research was conducted at the Haskins Laboratory in the US and elsewhere worldwide. The
sound classes are by no means homogeneous, and we still do not have complete knowledge on
all the variations and their causes. However, broadly speaking, each sound class can be identified
with a configuration of the vocal organs in making sounds in the class. We shall refer to this as
a speech posture. Thus, if the jaw is rotated a certain amount, and the lips held in a particular
position, with the tongue hump moved high or low, and back or forward, a vowel-like noise can be
produced that is characterised by the energy distribution in the frequency domain. This distribution
contains peaks, corresponding to the resonances of the tube-like vocal tract, called formants. As
the speaker articulates different sounds (the speech posture is thus varying dynamically and
continuously), the peaks will move up and down the frequency scale, and the sound emitted will
change. Figure 1 shows the parts of the articulatory system involved with speech production.Expression control using synthetic speech 2
Figure 1: The Human Vocal Apparatus
2.2 Vowel and Consonant Sounds
The movements are relatively slow during vowel and vowel-like articulations, but are often much
faster in consonant articulations, especially for plosive sounds like /b, d, g, p, t, and k/ (these
are more commonly called the stop consonants). The nasal sounds /m, n/ and the sound at the
end of “running”—/ŋ/, are articulated very much like the plosive sounds, and not only involve
quite rapid shifts in formant frequencies but also a sudden change in general spectral quality
because the nasal passage is very quickly connected and disconnected for nasal articulation by
the valve towards the back of the mouth that is formed by the soft palate (the velum )—hence the
phrase “nasal sounds”. Various hiss-like noises are associated with many consonants because
consonants are distinguished from vowels chiefly by a higher degree of constriction in the vocal
tract (completely stopped in the case of the stop consonants). This means that either during,
or just after the articulation of a consonant, air from the lungs is rushing through a relatively
narrow opening, in turbulent flow, generating random noise (sounds like /s/, or /f/). Whispered
speech also involves airflow noise as the sound medium, but, since the turbulence
occurs early in the vocal flow, it is shaped by the resonances and assumes many of the qualities
of ordinarily spoken sounds.
2.3 Voiced and Voiceless
When a sound is articulated, the vocal folds situated in the larynx may be wide open and relaxed,
or held under tension. In the second case they will vibrate, imposing a periodic flow pattern on
the rush of air from the lungs (and making a noise much like a raspberry blown under similar
conditions at the lips). However, the energy in the noise from the vocal folds is redistributed
by the resonant properties of the vocal and nasal tracts, so that it doesn’t sound like a raspberry
by the time it gets out. Sounds in which the vocal folds are vibrating are termed voiced. Other
sounds are termed voiceless , although some further qualification is needed.Expression control using synthetic speech 3
It is reasonable say that the word cat is made up of the sounds /k æ t/. However, although a
sustained /æ/ can be produced, a sustained /k/ or /t/ cannot. Although stop sounds are articulated
as speech postures, the cues that allow us to hear them occur as a result of their environment.
When the characteristic posture of /t/ is formed, no sound is heard at all: the stop gap, or silence, is
only heard as a result of noises either side, especially the formant transitions (see 2.4 below).
The sounds /t/ and /d/ differ only in that the vocal folds vibrate during the /d/ posture, but not
during the /t/ posture. The /t/ is a voiceless alveolar stop, whereas the /d/ is a voiced alveolar
stop, the alveolar ridge being the place within the vocal tract where the point of maximum
constriction takes place, known as the place of articulation. The /k/ is a voiceless velar stop.
2.4 Aspiration
When a voiceless stop is articulated in normal speech, the vocal folds do not begin vibrating
immediately on release. Thus, after the noise burst associated with release, there is a period
when air continues to rush out, producing the same effect as whispered speech for a short time
(a little longer than the initial noise burst of release). This whisper noise is called aspiration,
and is much stronger in some contexts and situations than others. At this time, the articulators
are moving, and, as a result, so are the formants. These relatively rapid movements are called
formant transitions and are, as the Haskins Laboratory researchers demonstrated, a powerful
cue to the place of articulation. Again, these powerful cues fall mainly outside the time range
conventionally associated with the consonant posture articulation (QSS) itself.
2.5 Synthesis by Rules
The first speech synthesiser that modelled the vocal tract was the so-called Parametric Artificial
Talker (PAT), invented by Walter Lawrence of the Signals Research and Development
Establishment (SRDE), a government laboratory in Britain, in the 1950’s. This device modelled
the resonances of the vocal tract (only the lowest three needed to be variable for good modelling),
just the various energy sources (periodic or random), and the spectral characteristics of the noise
bursts and aspiration.
Other formulations can serve as a basis for synthesising speech (for example, Linear
Predictive Coding—LPC), but PAT was not only the first, but is more readily linked to the real
vocal apparatus than most, and the acoustic cue basis is essentially the same for all of them. It
has to be, since the illusion of speech will only be produced if the correct perceptually relevant
acoustic cues are present in sufficient number.
Speech may be produced from such a synthesiser by analysing real speech to obtain
appropriate parameter values, and then using them to drive the synthesiser. This is merely
a sophisticated form of compressed recording. It is difficult to analyse speech automatically for
the parameters needed to drive synthesisers like PAT, but LPC compression and resynthesis
is extremely effective, and serves as the basis of many modern voice response systems. It is
speech by copying, however, and always requires preknowledge of what will be said, contains
all the variability of real speech. More importantly it is hard to link directly to articulation. A full
treatment of speech analysis is given in [Witten 82].
2.6 Speech Postures and the Face
It is possible, given a specification of the postures (i.e. sound classes ) in an intended utterance,
to generate the parameters needed to drive a synthesiser entirely algorithmically, i.e. by rules,Expression control using synthetic speech 4
Figure 2: Upper lip control points
without reference to any real utterance. This is the basis of our approach. The target values of the
parameters for all the postures are stored in a table (see table 1), and a simple interpolation procedure
is written to mimic the course of variation from one target to the next, according to the class of posture
involved. Appropriate noise bursts and energy source changes can also be computed. It should
be noted that the values in the table are relevant to the Hill speech structure model see [Hill 78].
Since the sounds and sound changes result directly from movements of the articulators, and
some of these are what cause in facial expression (e.g. lip opening, jaw rotation, etc.),
we felt that our program for speech synthesis by rule could easily be extended by adding a few
additional entries for each posture to control the relevant parameters of Parke’s face model.
2.7 Face Parameters
The parameters for the facial movements directly related to speech articulation are currently
those specified by Fred Park. They comprise: jaw rotation; mouth width; mouth expression;
lip protrusion; /f/ and /v/ lip tuck; upper lip position; and the x, y and z co-ordinates of one
of the two mouth corners (assuming symmetry which is an approximation). The tongue is not
represented, nor are other possible body movements associated with speech.
The parameters are mapped onto a group of mesh vertices with appropriate scale factors
which weight the effect of the parameter. An example of the polygon mesh representing the
mouth is illustrated in Figure 2.
3 Interfacing Speech and Animation
The system is shown diagramatically in Figure 3. Note that in the diagram the final output is to
film, modifications for direct video output are discussed below.
The user inputs text which is automatically translated into phonetic symbols defining the
utterance(s). The system also reads various other databases. These are: models of spoken
rhythm and intonation; tabular data defining the target values of the parameters for various Expression control using synthetic speech 5
Figure 3: Animated speech to film—system overviewExpression control using synthetic speech 6
articulatory postures (both for facial expression, and acoustic signal); and a set of composition
rules that provide appropriate modelling of the natural movement from one target to the next.
The composition program is also capable of taking account of special events like bursts of noise,
or suppression of voicing.
3.1 Parameteric Output
The output of the system comprises sets of 18 values defining the 18 parameters at successive 2
millisecond intervals. The speech parameters are sent directly to the speech synthesiser which
produces synthetic speech output. This is recorded to provide the sound track. The ten face
parameters controlling the jaw and lips are described in [Hill 88] and are taken directly from the
Parke face model. Table 2 shows the values associated with each posture.
3.2 Face Parameters
The facial parameter data is stored in a file and processed by a converter program. The purpose
of this program is to convert the once per two millisecond sampling rate to a once per frame
time sampling rate, based on the known number of frames determined from the magnetic film
sound track. This conversion is done by linear interpolation of the parameters and resampling.
The conversion factor is determined by relating the number of two millisecond samples to the
number of frames recorded in the previous step. This allows for imperfections in the speed
control of our equipment. In practice, calculations based on measuring lengths on the original
audio tape have proved equivalent and repeatable for the short lengths we have dealt with so far.
Production equipment would be run to standards that avoided this problem for arbitrary lengths.
The resampled parameters are fed to a scripted facial rendering system (part of the Graphicsland
animation system [Wyvill 86]). The script controls object rotation and viewpoint parameters
whilst the expression parameters control variations in the polygon mesh in the vicinity of the
mouth, producing lip, mouth and jaw movements in the final images, one per frame. A sequence
of frames covering the whole utterance is rendered and stored on disc. Real time rendering of
the face is currently possible with this scheme—given a better workstation.
3.3 Output to film
The sound track, preferably recorded at 15 ips on a full track tape (e.g. using a NAGRA tape
recorder), is formatted by splicing to provide a level setting noise (the loudest vowel sound as
in gore ) for a few feet; a one frame-time 1000 Hz noise burst for synchronisation; a 23 frame
time silence; and then the actual sound track noises. The sound track is transferred to magnetic
film ready for editing. The number of frames occupied by the speech is determined for
use in dealing with the facial parameters. Getting the 1000 Hz tone burst exactly aligned within
a frame is a problem. We made the tone about 1.5 frames in length to allow for displacement in
transferring to the magnetic film. the stored images are converted to film, one frame at a time.
After processing, the film and magnetic film soundtrack are edited in the normal way to
produce material suitable for transfer. The process fixes edit synch between the sound track and
picture based on synch marks placed ahead of the utterance material. The edited media are sent
for transfer which composes picture and sound tracks onto standard film.Expression control using synthetic speech 7
3.4 Video Output
Many animators have direct workstation to video output devices. Experience has shown that
the speed control on our equipment is better than anticipated so that our present procedure is
based on the assumption that speed control is adequate for separate audio and video recordings,
according to the procedures outlined above, and for straight dubbing to be carried out onto the
video, in real time, once the video has been completed. With enough computer power to operate
in real time, it would be feasible to record sound and image simultaneously.
4 Sampling Problems
Some form of temporal aliasing might seem desirable, since the speech parameters are sampled
at 2ms intervals but the facial parameters at only 41.67ms (film at 24 frames/sec.) or 33.33
(video at 30 frames/sec). In practice antialiasing does not appear to be needed. Indeed, the
wrong kind of antialiasing could have a very negative effect, by supressing facial movements
altogether. Possibly it would be better to motion-blur the image directly, rather than antialiasing
the parameter track definitions. However, this is not simple as the algorithm would have to keep
track of individual pixel movements and processing could become very time consuming. The
simplest approach seems quite convincing, as may be seen in the demonstration film.
5 A practical example
5.1 The animated speech process
The first step in manufacturing animated speech with our system is to enter text to the
speech program which computes the parameters from the posture target-values-table.
Text entered at keyboard: speak to me now bad kangaroo
Input as: s p ee k t u m i n ah uu b aa d k aa ng g uh r uu
(Phonetic Representation)
Continuous parameters are generated automatically from discrete target data to drive the
face animation and synthesized speech. The speech may be altered by editing the parameters
interactively until the desired speech is obtained. This process requires a good degree of skill
and experience to achieve human like speech. At this point all of the parameters are available for
editing. Figure 4 shows a typical screen from the speech editor. The vertical lines represent the
different parts of the diphones (transition between two postures). Three of the eight parameters
(the three lowest formants) are shown in the diagram. Altering these with the graphical editor
alters the position of the formant peaks which in turn changes the sound made by the synthesiser
or the appearance of the face. Although the graphical editor facilitates this process, obtaining the
desired sound requires some skill on the part of the operator. It should be noted that the editing
facility is designed as a research tool.
On the diagram seven postures have been shown—giving six diphones. In fact the system
will output posture information every 2ms and these values will then be resampled at each frame
time. The is particular to the Parke face model. A full list of parameters
corresponding to the phonetic symbols is given in table 2, as noted above.Expression control using synthetic speech 8
Figure 4: Three of the speech parameters for “Speak to me now, bad kangaroo.”Expression control using synthetic speech 9
5.2 Historical Significance of “Speak to me now ...”
The phrase “Speak to me now, bad kangaroo” was chosen for an initial demonstration of our
system for historical reasons. It was the first utterance synthesised by rule by David Hill at
the Edinburgh University Department of Phonetics and Linguistics in 1964. It was chosen
because it was unusual (and therefore hard to hear), and incorporated a good phonetic variety,
especially in terms of the stop sounds and nasals which were of particular interest. At that time
the parameters to control the synthesiser were derived from analog representations in metal ink
that picked up voltages from a device resembling a clothes wringer (a “mangle”), in which one
roller was wound with a resistive coil of wire which impressed track voltages proportional to
the displacement of silver-ink tracks on a looped mylar sheet. The tracks were made continuous
with big globs of the silver ink that also conveyed the voltages through perforations to straight
pick-off tracks for the synthesiser controller that ran along the back of the sheet. When these
blobs ran under the roller, a violent perturbation of the voltages on all tracks occurred. However,
the synthesiser was incapable of a non-vocalic noise so, instead of some electrical cacaphony,
the result was a most pleasing and natural belch.
6 Conclusion
This paper has presented a practical guide to using the speech by rule method to produce input
to the Graphicsland animation system. Our approach to automatic lip synch is based on artificial
speech synthesised by simple rules, extended to produce not only the varying parameters needed
for acoustic synthesis, but also similar parameters to control the visual attributes of articulation
as seen in a rendered polygon mesh face (face courtesy of Fred Park). This joint production
process guarantees perfect synchronisation between the lips and other components of facial
expression related to speech, and the sound of speech. The chief limitations are: the less than
perfect quality of the synthesised speech; the need for more accurate and more detailed facial data; and the use of a more natural face, embodying the physical motion constraints
of real faces, probably based on muscle-oriented modelling techniques (e.g. [Waters 87] ). Future
work will tackle these and other topics including the extension of parameter control to achieve
needed speech/body motion synchrony as discussed in [Hill 88].
7 Acknowledgements
The following graduate students and assistants worked hard in the preparation of software and
video material shown during the presentation: Craig Schock, Corine Jansonius, Trevor Paquette,
Richard Esau and Larry Kamieniecki. We would also like to thank Fred Parke who gave us his
original software and data, and Andrew Pearce for his past contributions.
This research is partially supported by grants from the Natural Sciences and Engineering
Research Council of Canada.