DRAGON SYSTEMS RESOURCE MANAGEMENT BENCHMARK RESULTS FEBRUARY 1991
6 Pages
English
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

DRAGON SYSTEMS RESOURCE MANAGEMENT BENCHMARK RESULTS FEBRUARY 1991

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer
6 Pages
English

Description

preliminary in training complete necessary a the which been are Warfare RESULTS workshop. energy start 19911 system overall signal Baker, Research computed recognition Baker, slrategies, Pard system Bamberg, benchmark Larry characteristics Gillick, presenting Lori dependent LameI, detail Robert training. Roth, This Francesco Advanced every Projects used that Sturtevant, results Ousmane has Ba, system Richard Next, Benedict dependent, 3 a input Systems' region 1. the ~e-estimate covering results, comlxments test spectral test 7 speaker- parameters Section eight described are speaker-dependent (617) "context" there in simple: data, (617) sponsored quite modifications representation term signal representation the A sampled Space is continuous speech to board. Systems results different obtained task TMS32010-based this PC. a 486-based algorithm a speech running Systems (mammography algorithm Maaaagernent post-processing task Resource task. Dragon recognizer. Conlract performance this conceptual transfer real-time evaluating near a capable these to material. demonstrated evaluation speaker-dependent data, system development [1,2,3]. spellings meeting results presented fixed. system is recognition the speech is continuous re-estimated Systems' "phoneme-in-context" Dragon where available. processing are a ...

Subjects

Informations

Published by
Reads 29
Language English

Exrait

preliminary
in
training
complete
necessary
a
the
which
been
are
Warfare
RESULTS
workshop.
energy
start
19911
system
overall
signal
Baker,
Research
computed
recognition
Baker,
slrategies,
Pard
system
Bamberg,
benchmark
Larry
characteristics
Gillick,
presenting
Lori
dependent
LameI,
detail
Robert
training.
Roth,
This
Francesco
Advanced
every
Projects
used
that
Sturtevant,
results
Ousmane
has
Ba,
system
Richard
Next,
Benedict
dependent,
3
a
input
Systems'
region
1.
the
~e-estimate
covering
results,
comlxments
test
spectral
test
7
speaker-
parameters
Section
eight
described
are
speaker-dependent
(617)
"context"
there
in
simple:
data,
(617)
sponsored
quite
modifications
representation
term
signal
representation
the
A
sampled
Space
is
continuous
speech
to
board.
Systems
results
different
obtained
task
TMS32010-based
this
PC.
a
486-based
algorithm
a
speech
running
Systems
(mammography
algorithm
Maaaagernent
post-processing
task
Resource
task.
Dragon
recognizer.
Conlract
performance
this
conceptual
transfer
real-time
evaluating
near
a
capable
these
to
material.
demonstrated
evaluation
speaker-dependent
data,
system
development
[1,2,3].
spellings
meeting
results
presented
fixed.
system
is
recognition
the
speech
is
continuous
re-estimated
Systems'
"phoneme-in-context"
Dragon
where
available.
processing
are
a
evaluation
areas
using
training
a
in
results
respall
comparative
Defense
Phonetic
a
thus
primary
(PELs).
aims.
time,
long
system
with
corresponding
line
through
differences.
a
are
was
monitored
material
system
constitute
speech
a
standard
lest
Naval
evaluation
changes
February
addition
curve,
been
learning
goal
representation
training
steep
described.
a
the
still
results.
befieve
evaluate
speaker-dependent
modifications
Since
modification
paper.
given.
this
is
configurations
recognition
system
continuous
training
speaker-
compare
1990
fundamental
Dragon
used
overview
concepaml
a
in
First,
speaker-dependent
Management
the
the
hardware,
algorithm
speaker,
work
the
a
changes
report
responding
paper
when
set
system
DRAGON SYSTEMS RESOURCE MANAGEMENT
BENCHMARK FEBRUARY
James Janet
Scattone, Dean
Dragon Systems, Inc.
320 Nevada Street
Newton, Massachusetts 02160
DRAGON@A.ISI.EDU
TEL: 965-5200
FAX: 527-0372
ABSTRACT
Recognition are given for the RM1 In this paper we present preliminary at Dragon Systems on
the Resource benchmark The basic units of data and for Feb91
our system are Phonemes-m-Context (PICs), which are represented as In we make at
Hidden Mmkov Models, each of which is eapressed as sequence of
the of our Elements The PELs to given phoneme
kind of alphabet for the of PICs. to in the and
signal processing algorithm. Our experimentation was
For the tests, two basic methods of the acoustic
performed using the development test models were investigated. 'nac first method of training the Resouro~
Managemera models is to the models for each test speaker from ~ta~ and dam
that speaker's training keeping the PEL of the PICs The in we that we are on
second approach is to use the models from the first melhod to
the 1991 derive segmentation of the then to the PICs in hrgely
speaker-depmdmt manner in order to improve the of speaker tun the only one and
full explanation of these methods is given, as are using the dam not yet
each method.
In to repotting on two we disoass N-
2. OVERVIEW OF THE DRAGON Best The N-Best algorithm is of the
proposed by Soong and Huang at the Jtme This CSR SYSTEM
runs as step and uses an A*-search (an also
known as decoder').
was at the June 1990 DARPA The
is and was be INTRODUCTION
of on an word
In we on some done at reports), when on
on The signal processing is performed by an additional
task. brief of The at kHz and
the is only
the to on -- up
RM are Our to make to kHz and an ~ameter -- of
the in ways 20 ms and as to
Dragon's The HMM-based
so far have the of
and speaker-dependent The The unit used the
in in 4. or PIC, the word
work was by the Agency and was by the and
Command under
59
N000-39-86-C-0307.
1.
12
844
'stack
to arc these Approximately
emulate
sentences
refers
general
labeling
alternatives.
semi-automatic
test
course
preprocessing
information
consonants,
about
speaker
selected
comparable
surrounding
Recognition
phonetic
the
environment
through
were
Dragon's
is
set
turn,
3
which
lexicon
determine
unstressed
the
suggests
acoustic
error
character
speaker's
PICs,
BEF,
sequence
selected
a
using
concatenating
question
question.
that
Several
board.
obtained
from
alternative
return
approaches
indicating
sentence)
a
appeared
model
(or
entries
the
multiple
literature
such
a
words,
Currently,
errl!dated
context
task.
model
finding
recognition
training
models
the
includes
the
available.
comes
identity
phrases,
training
the
relevant
PICs
which
speaker
cases
a
succeeding
signal
phonemes
emulate
rare
highly
well
possible
models
this
whether
signal
speaker's
above,
phoneme
a
reference
lexicon
speaker,
hand.
a
phonemes
prepausally
contains
lengthened
vowels
utterances
returns
training
in
are
syllabic
modeled
language
600
will
a
software
sequence
standard
the
stress
entirely
stressed
(phonetic
comparison
elements),
expected
each
39,000
based
that
which
speakers
represents
rates
a
recorded.
were
rate
estimates
average
duration
constraint
parameters
conform
acoustic
pair
the
reference
which
performed,
shared
reference
models
fi'om
build
words
was
texts
representing
phrases
the
training
goal
sentences
phoneme.
hardware,
A
using
detailed
reference
description
test
segmentation.
assess
models
order
initial
is
provide
benchmark
models
software
speaker's
order
reference
system
the
primarily
trained
acquisition
using
Prior
speakers,
the
the
been
each
obtained
models
acquisition
Modifications
processing,
dependent
score
speaker-
current
build
English
strategy
were
training
modified
procedure
word.
the
candidate
presented
score
version
English
Section
24
automatic
probability
a
(each
report[2].
which
earlier
system
detail
degrees
programming
stress),
described
3
extend
consonants.
mistakes.
22%
sentence
soon.
hypotheses
available
subject
SNOR
rapid
emulation
the
hardware
are
pronunciations.
pruning
pronunciations
used
reflect
errors
differences,
eliminate
reference
proportion
performance
paths.
versions
Another
explicit
important
well.
component
reasonably
small
signal
a
slandard
system
used
insure
modeling
conservative
vocabulary
rapid
this
matcher,
better
described
error
settings
fact
[3],
3.5%
which
all
limits
the
parameter
that
match
sentences,
rapid
standard
standard
the
candidates
adapting
that
models
modified
base
therefore
grammar.
hypothesized
Iralning
research,
used
start
build
phase
models
current
DAS).
given
speaker
frame.
primarily
the
general
alternative
isolated
approaches
(those
issue
supplemented
focused
a
rapid
hundred
match
development
problem
from
time.
sentences.
processing
generation
been
sets
outlined
three
reduce
Dragon
others
detail
[8,9,10].
recorded,
A
section.
lexicon
dam
paper,
performed
this
small
reported
this,
task
hardware.
experiments
software
the
processing
all
well
specified
addressed
before
hardware.
models
task
in
written
used
system,
built.
to
Pronunciations
hardware.
were
would
supplied
performance
match
modifications
each
concerned
entry
signal
rapid
thought
standard
stages.
is
Thus
sequence
evaluation,
denoting
system
extracting
acquisition
them
evaluated
from
performed
a
been
standard
always
lexicon.
hardware.
word-pair
signal
entries
described
the
allowed
principle
to as much the for the RM had to be
as nec~ to could be for
of the phoneme in related in the SNOR leficon by our
have in Any not found in Dragon's
for our the of the added by The of
prec~ing and as as the used for
is in segment PICs of may have of and
as of PELs of of the in the
have been given These "slate" in an HMM. PE~ may be
may as among PIC models same
and of function and of for PICs and how they are
may be found in [2]. made to the PIC pronunciation
are in 4.
Roughly PICs are in the
Recognition uses frame-synchronous dynamic for The set of PICs was by of
to the to PICs can occur given the that
beam to poor must to word The data
of the is the in to PIC for the
the number of word can be Engli~ and
to at any Some by few from RM1
to the have ~so The andAraining of is
by dis~ssed in more in the next
The used the CSR log
3. MODIFICATIONS TO THE SYSTEM FOR
the of the This
USE WITH THE RM TASK
was to fixed if the word is by
grammar or flag that the In to be able to run the RM on the
impermissible. Dragon speaker-dependent continuous speech recognition
The module was of system, several modifications were necessary. These
in in order to the and
We have not on the of proc~ing time in to had
of our and have only on data own
our to be suffidently
so as to that only of the
due to match The as has
by the it was
the of the be 4. TRAINING ALGORITHMS FOR THE
tuned to the In run the RM
SPEAKER-DEPENDENT MODELS
the was to the
One to be how the Dragon's strategy for phoneme-based training was
does in fact the To in in an We have used fully
was new from of same to
Dragon's speaker. The for of RM1
of to an
the of CMR, and The to in
was using and almost on
after to for each using the
and an word of was The only in for no
that the rate is to of some of the ~ta is
RM1 we have our
proc~ng An of The for word is
on the using our by of each of is, in
and our be in the of the of
60
100
clam
deXermined
17
[5,6,7]. Adaptation
carried
rote
a
into
sentences
1:
dam
segmentations
data
a
acqtfired
all
(1),
training
strategy
a
reference
training
speaker.
utterances
word-pair
dependent
a
enough
isolated
distributions,
results
terms
these
that
Merging
voice.
short
duration
phrases.
labeling,
technique
into
changing
first
reference
a
adapted
models.
approximately
a
remaining
than
task,
straightforward
available.
training
additional
together
training
segments
sufficient
steps:
task-
nothing
spccitic
training
Iralning
spectral
utterances
reference
results
turn,
2
PEI.~
reference
speaker-dependent
3:
training
computed.
this
added.
using
a
speech.
durations
together
than
further.
speaker-dependent
a
spelling
phonemes.
speaker-dependent
speaker
training
multiple-speaker
a
passes
results
a
terms
starting
in
adaptation
'Yespelling"
recognition,
thereby
algorithm,
spectral
speaker"
task,
resulting
represents
talkers
that
averaging
construct
algorithm
programming
then
that
1:
training
technique
legal
duration
according
alone,
fails
times.
legal
training
word-pair
dam
a
dealing
corresponding
labeled
representecl
all
a
.
frequently
durations
"relatedness").
follows:
total
algorithm
training
spellings
set.
a
criteria
speakers,
usual
speaker-dependent
missing
automatic
(using
speaking
spectlal
models.
training
features
a
his
construct
.
data
segmentation
related
technique
insufficient
technique
one-fifth
with
incorrect.
a
that
speaker-
right
this
either
allophone
entirely
reference
sentences,
resulting
A
a
training
starting
there
adaptation
sentences.
six
trairfing
speaker,
part,
dealing
examples
with
magnitude
(2),
constructed
reference
related
models.
represent
first
that
a
training
adaptation
sentences
speaker-independent
a
a
report[2],
"multiple
unlikely
utterances
earlier
formants
in
generally
averaging
spectra
evaluation
a
sentences.
this
this
mixture
Details
data.
frame)
adaplafion
values
12
spectral
re-estimate
speakerMependent
immmeters
a
following
three
first
create
"spelling"
taken
(3)
1.
algorithm
appropriate
typically
parameters
multiple
that
scripts.
sentences.
since
training
extracted
characteristics
large amount of horn the Step 2: For given maximum of of
about 9000 words and 6000 In are out, from the
to the Resource Management an set of The models are used to segment the
into At point we have good flora lhe speaker were
dependent set of PEL models, and set of Although less 10% of the data was drawn
which to proceed from the Resource Management most of the PICs are
to the grammar are
somewhere in lhe Legal PICs from
the set are typically like the sequence "ah-uh-ee" The second begins with the models produced by
would occur in "WICHITA EAST': for the most they the with the of the
do not occur in the and seem to data phonemes done those same
occur in Using is
performed for each of the RM1 to produce new
set of PIC models -- with new PEL The reference speaker's models are in
and models. The is as dk~nct ways:
The of the PELs depend on the
of the speaker's Step For each phoneme in the
for phoneme are from the
The for the in each Markov model for For each PIC involves the phoneme, an
PIC depend on the reference speaker's weighted average of these data is to
and other of model (a sequence of expected for each for the
PIC. of process may be found our
The sequence of PELs used in the Markov model for but the key idea is to take weighted average
PIC depends on what the of phoneme tokens the PIC to be modeled or
uses in given context closely PICs.
We report on two techniques for creating speaker- The number of PICs to be for each phoneme
PICs with the speaker's is of the same order of as the number of
The is in which of the phoneme in the 600 Since are
new speaker's are segmented PICs examples of only about 6000 PICs in the RM1
and PELs using set of base models, and the are for most PICs the models must be based on
used to the of the PELs and of the data with the left or context For about
models. This is run of the 30000 PICs, therewere
This ~'oach is very effective in with to model the for
since the 600 include for almost all of This is the case when diphone
the PELs. This is less effecdve in to word pair to occur in the
only about 6000 of the 30000 PICs occur in the sentences.
however, can do to change
the of each PIC in of PELs. Step 2: Dynamic is used to the
sequence of PELs best the model for
The uses the two each PIC, the PIC of PELs. This
in PEL for each PIC. In
Step The data from ofthe spe,~ers were used to the process, for each PEL in
adapt the reference speaker's models. Three passes of PIC are also
weze performed with these Since Dragon's
does not yet use has the Step Step in respelled PICs for those PICs for
effect of for male and female which data are For the
and "washing out" in PELs for 6000 PICs, the PIC models of the
vowels. The models are not good speaker are used (as in
to do but they serve Pies in model for every legal PIC in the
as better basis for speaker do the reference grammar.
speaker's models.
61
1). speaker-dependent
using
referred
94
adaptation
Comparison
from
92
shown
recognition
results
results
this
92
also
90
which
speakers
speaker-dependent
using
the
the
algorithm.
72
A
methods
sets
9
since
speaker
process,
8
algorithm
speaker
sets
dependent
sets
models
stage,
(SD-PELs)
this
7
strategies
speaker-dependent
15.0
respelling
error
6
9.1
PICs
models,
(SD
pass
PICs).
speaker
5
present
error
variety
rates
models,
are
unchanged
%
distributions
#
training
percentages
create
algorithm.
provides
the
models
N-Best
have
development
models.
test
referred
clam
final
using
table
the
Table
Feb91
are
evaluation
training
data.
each
list
1.9
choice
rates
sentences
12.6
percentage
N-best
Cumulative
processing
2:
well
speaker.
4:
each
final
sentences
dependent
25
consists
consisted
resegmenting
which
results
data,
section
evaluation
PICs.
SD-PELs
a
Feb91
parameters
10.5
duration
models
PELs.
respelled
available
the
duration
performance
data
display
re-estimated.
12.4
reference
3.1
original
speakers,
speaker-dependent
RM1
models
13.9
unchanged
the
contain
each
with
sentences
models
7.2
experimented.
test
first
development
second
rates
respeHed
error
speaker-dependent
the
models.
10.5
output
display

Table of for RM1 two of Iraining:
and of Word repo~d as for
RM1 and
SD-PICs
Evaluation Speaker Development Development
BEF 6.3
6.9 6.8
2.9 DAS(f) 4.3
4.1 DMS(0 3A 3.6
DTB 7.6 7.2 3.6
5.6 4.4 7.8 DTD(O
ERS
2.5 5.6 HXS(0
JWS 6.3 4.7 4.5
PGH 5.3 5.5
RKM 9.8 9.9
TAB 3.6 4.3 5.3
Average 7.0 5.4 7.5
Step of of Table of on the
the training data into PELs and then re-estimating the the
of the In the
are
The above to PIC
two of we
The set is to as
RM The set is the of the and
is to as the RM
Both of may
PICs the speakex when no
was -- mainly
most PELs are used in of
5. RECOGNITION EXPERIMENTS
AND DISCUSSION
In we making use of the two
of as as on post
with the
5.1 of two methods for speaker-
dependent training
The of
in In we word
on the for of
and we also the of
on the of
for
62
15
12 100 93 14
1.
93 13
93 12
Comparison 93 11
93 10
93
91
88 speaker-dependent
87
83
Cumulative Choice
ccorrect
CMR(0
SD-PICs
1: performed
Pennsylvania,
which
Analysis
22%
Errors
a
for
pp.
Speaker-Dependent
involve
Respelled
alternations
Pennsylvania,
the
Valley,
pass
Hidden
Rapid
course
reverse
1990,
reverse
78-81.
RM1
research
Modeling
Continuous
A
Speech
algorithrn
Recognition,"
Speech
enlightening
reverse
Proceedings
Search
investigate
Continuous
A
a
Recognition
Speech
Speech
were
will
conservatively.
Continuous
occur).
focus
->
Dragon
utterances
discussion
entries
Sturtevant,
Acoustic-Phonetic
D.
are
performance
forward
GiUick,
test
Y.L.
implemented
respelled
Soong
models
step
Bamberg,
processes
recognizing
1985.
P.
provide
development
IEEE
Bamberg
partial
time.
the
the
paper,
error
process
rates
Valley,
are
acoustic
seen
delivered
90%
sentence
range
C'theh
choices
conlrolling
a
speakers.
5
pp.
transcription
12-19.
2.5%
sentences,
instance,
from
speaker
pronur~iafions
HXS
al.,
2.
alternate
Table
test
utterances)
expected
test
stress
with
match
1200
reverse
overall
current
average
that
error
Continuous
rate
Soong
the
similar
5A%.
proposed
(based
Huang[4].
cumulative
a
A
is
same
decoder
system
speech
choice
Computational
correction
the
the
Based
misrecognitions,
close
rapid
best
match
Speech
module,
a
around
Although
is,
description
Hidden
Algorithm
(almost
International
94%
Speech,
vastly
difference
increased,
3.
choice
Pennsylvania,
there
1990,
a
a
delivered
Signal
a
(i.e.,
small
frequent
reduction
transcriptions
algorithm
alternative
N-Best
asymmetric
observed
1989.
overall
N-Best
error
parameters
rate
confusion
from
RM1
5.4%
Other
time,
errors
5.1%.
contractions:
70%
from
62%
Gillick,
transcription
development
correct
Schwartz
errors
Use
involve
alternate
function
Approximately
words
"Phoneme-in-Context
only,
lexical
determined
"Context-Dependent
the
pronunciations.
remaining
variants
38%
used
involve
express
a
pronunciation
content
different.
algorithm
differences.
(and
scores
forward
acoustic
also
recognition
include
Recognition
a
implemenlation,
function
Continuous
identical.
Speech",
being
development
Function
data.
words
N-Best
transcriptions
which
such
approximate
error
Acouaics,
consider
170-172.
below
Speech,
7.6%
Pennsylvania,
given
Valley,
results
Hidden
2.5%
extension
average.
post-processing
content
E.-F.
words.
essentially
delivered
stack
were
which
choices
Huang,
content
Tree-Trellis
Valley,
time.
error
results
is
during
"SPS-40"
forward
which
are
identical,
Fast
often
reason
misrecognized
approximations
considered
the
"SPS-48".
score
Other
a
content
transcription
word
extends
errors
reverse
often
transcription.
involve
a
homophones
complete
(such
Also,
transcriptions
algorithm
"ships+s"
Match
pronunciations.
scope
"ships").
Roth,
Function
speech
alternative
that
and/or
Gillick
pauses
between
internal
algorithm
placement
163-169.
differing
that
ones
Soong
included
Hidden
substitutions
June
transcriptions
DARPA
six).
full
symmetric
match
factor
Recognizer,"
a
pass
recognition
pp.
(slowing
saved
of The the were set
PICs With high confidence, the best
down the
In the of our it has been to by about of These
the errors. We now our on only in of
If such are the of the when the
data: The word to from on The
low of for to 10.5% for ERS, an do as
of When the very
The the is run without the the amount of
compmafon is but is only of the and the it as
of the to of the time always as one of the top
Roughly of the That for 80% of the
and word may was on the list count on
word error). have an is given in For the
rate of compa~d to for The was one of the top of
most common word is
as 7. REFERENCES
as --~
word deletions are more common than insertions, and Chow, L. R. Roth, and
may be ¢'and" --> "in" are as "The
as "in" --> "and") or --> "the" but the System: Real-Time Implementation," of
does not common June
"what is" ''what+s" and ''when will"
--> "when+ll".
2. P. and L.
of for Dragon's
of
of the have
These are to
and/or L. and R. "A for
Recognition," of
5.2 N-Best Algorithm Test. June 1990
pp.
pass using an N-Best algorithm was
on the The
4. F.K. and "A
we have is to one by
for Finding the N-Best Sentence Hypotheses in
and It runs as and
of
the in
June 1990,
used to very to
of full
5. R. et Modeling for
more of
of IEEE
is beyond the of the we note key
the we use and of
April
and Huang is that we do in the
we the dam). the our
6. Bahl et al., "Large Vocabulary Natural Language
sc~es are only is in our
Continuous Speech Recognition",
the and
May
The was run on the 1200 the
each the
63
12 100
Processing, and Acoustics, on Conference
Processing,
Signal and on Conference International
Workshop, Language Natural and Speech
DARPA the Proceedings
Workshop, Language Natural and Speech
DARPA Proceedings
Workshop, Language Natural and
Proceea~ngs
Workshop, Language Natural and Speech DARPA
~'
1.
ccorrect
15).
17
100 Speech
88,
in
1989.
L
Glasgow,
City,
a
Speech
89,
April
Decoding",
Pruning
Sphinx
Bahl,
Speech
Match:
Recognition
Large
System",
System",
IEEIq
York
International
D.
Conference
"Fast
Words
Gopalakrishnan,
Acoustics,
S.
Speech,
1989.
Candidate
Words
Signal
Polling
Processing,
a
May
Vocabulary
1989.
Recognition
8.
"Matrix
Lalit
D.
Bahl,
Kanevsky,
Short
1988.
Bakis,
Xavier
Peter
Aubert,
V.
Look-Ahead
a
Strategies
Souza
Continuous
Identifying
Recognition",
Robert
89,
A
May
Mercer,
10.Lalit
"Obtaining
Candidate
7.
K.F. Lee at, "The
on and
Raimo de and L.
by
ICASSP New
9. in
ICASSP Glasgow,
P.
Nahamoo, Fast Fast Method for
List of for
ICASSP May
64