adiaudio - vi l t ti h itiisual automatic speech ... · adiaudio - vi l t ti h itiisual automatic...
Post on 17-Sep-2018
213 Views
Preview:
TRANSCRIPT
A di i l t ti h itiAudio - visual automatic speech recognition (AV-ASR)
Rainer Stiefelhagen
Vorlesung „Visuelle Perzeption für Mensch-Maschine Schnittstellen“ WS 2009/2010Maschine Schnittstellen , WS 2009/2010
February 8 2010
Interactive Systems Laboratories, Universität Karlsruhe (TH)
February 8, 20101
Overviewer
actio
nH
)
I t d ti
ompu
ter I
nte
Kar
lsru
he (T
H IntroductionMotivation, McGurk effect
Vis al feat re e traction
or H
uman
-CU
nive
rsitä
tK Visual feature extractionAppearance based featuresModel-based features
uter
Vis
ion
forc
h G
roup
, U Model based features
AV-Speech recognitionBasic building blocks of ASR systems
Com
puR
esea
rci
Basic building blocks of ASR systemsVisemes vs. phonemesAV-Fusion approaches
cv:h
c
Recent work at ISLAV-ASR from multiple views
2
McGurk Experimenter
actio
nH
)McGurk Experiment
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
3
McGurk Experimenter
actio
nH
)p
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
4
McGurk Experimenter
actio
nH
)p
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
5
McGurk Effecter
actio
nH
) People Fuse Visual and Acoustic Info
ompu
ter I
nte
Kar
lsru
he (T
H People Fuse Visual and Acoustic InfoVisual Info Complements AcousticEff t k i l t ll L
or H
uman
-CU
nive
rsitä
tK Effect works in almost all LanguagesWeaker in Some (Japanese, Chinese)
uter
Vis
ion
forc
h G
roup
, U Appears to Work particularly well for Visible Phones
Com
puR
esea
rci
Bateson ExperimentsIn Conversation, Random Eye-Gaze is Reduced under
cv:h
c , yNoiseVisual Info Becomes more Important in Noise
6
What is automatic audio-visual speech recognition (ASR)?
erac
tion
H)
(ASR)?om
pute
r Int
eK
arls
ruhe
(TH
Conventional ASR systems use only audio (speech) data as input
or H
uman
-CU
nive
rsitä
tK data as input.
A di i l AS di d i l
uter
Vis
ion
forc
h G
roup
, U Audio-visual ASR systems use audio and visual (video) data.
Com
puR
esea
rci
Images around lip areas are mainly used as visual data.
Audio-visual speech recognition is also called bi-
cv:h
c
modal speech recognition.
7
What is the motivation?er
actio
nH
)What is the motivation?
Humans use both audio and visual information to
ompu
ter I
nte
Kar
lsru
he (T
H Humans use both audio and visual information to smoothly communicate with each other.People can compensate insufficient speech
or H
uman
-CU
nive
rsitä
tK
People can compensate insufficient speech information with visual one.Visual cues are often complementary to audio cues
uter
Vis
ion
forc
h G
roup
, U
p y“ma” vs. “na” (easier from vision)“pa” vs. “ba” (easier from audio)
Com
puR
esea
rci
cv:h
c Can we improve performances of ASR systems by using both audio and visual information?
8
Are visual cues useful for human perception?(Potamianos et al., Proc. Euro Speech, Sep. 2001)
erac
tion
H)
( p p )om
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tKut
er V
isio
n fo
rch
Gro
up, U
Com
puR
esea
rci
cv:h
c
9Human improves the performace by using visual information!
Basic processing blockser
actio
nH
) Audio Data Visual Data
ompu
ter I
nte
Kar
lsru
he (T
H Audio Data
or H
uman
-CU
nive
rsitä
tK Face Detection
Lip Detection
Audio Feature Extraction
uter
Vis
ion
forc
h G
roup
, U Lip Detection
Com
puR
esea
rci
Visual Feature Extraction
Audio vectorVisual vector
cv:h
c
Audio-Visual ASR
Visual vector
10
Audio Visual ASR
Mouth Localization Approacheser
actio
nH
)pp
Early Work: Manual/Semi-automatic approaches
ompu
ter I
nte
Kar
lsru
he (T
H y ppUse fixed window / no head movementUse lip-stick with easy to extract colors
or H
uman
-CU
nive
rsitä
tK
Automatic Approaches
uter
Vis
ion
forc
h G
roup
, U
Automatic ApproachesSimple Templates (very problematic)Integral Images ( see lecture 6 on head pose)
Com
puR
esea
rci
g g ( p )Haar-Filter Cascades ( lecture 3)Deformable Models: Snakes, Active Contours, Active
cv:h
c Shape Models, Active Appearance Models
11
Visual feature extractionA (i ) b d f t
erac
tion
H)
Appearance (image) based featuresPixel values of region-of-interest (ROI) like a lip image are directly used.
ompu
ter I
nte
Kar
lsru
he (T
H Easier, more robust extractionHigh dimensionality (-> PCA, LDA, FFT, DCT, Differences between adjacent frame i )
or H
uman
-CU
nive
rsitä
tK images)
Model-based features
uter
Vis
ion
forc
h G
roup
, U Assumes that most information is in the shape of the lipsModel parameters used for recognition
Com
puR
esea
rci
Lower dimensionalityMore difficult to obtain Example: Active Shape Model (ASM)
cv:h
c p p ( )
Hybrid ApproachesActive Appearance Model
12
Active Appearance Model
Appearance-based featureser
actio
nH
)pp
ompu
ter I
nte
Kar
lsru
he (T
H
Pixel values of region-of-interest (ROI) like a lip i d l d
or H
uman
-CU
nive
rsitä
tK image are directly usedROI / feature vector
uter
Vis
ion
forc
h G
roup
, U Advantage: Easier, more robust extraction
Com
puR
esea
rci
Disadvantages:Ill i ti i ti
cv:h
c Illumination variations histogram normlization, etc.
High dimensionality of feature g yvector
PCA, LDA 13Histogram Normalization
Use Normalized Greyscale Image of Mouther
actio
nH
)Use Normalized Greyscale Image of Mouth
grayvalue modification - example histogram :li i l)(
))(()´(f
pfTpf =
ompu
ter I
nte
Kar
lsru
he (T
H
grayvaluenew:)´(functionon modificati:
grayvalueoriginal:)(
pfT
pf
or H
uman
-CU
nive
rsitä
tK
grayvaluenew:)(pf
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
14
FFTer
actio
nH
) Transform the image of the mouth region using
ompu
ter I
nte
Kar
lsru
he (T
H Transform the image of the mouth region using FFT
Transformation to the frequency domain
or H
uman
-CU
nive
rsitä
tK
Transformation to the frequency domainInvariant to translationFrequency-based features are known to be helpful for
uter
Vis
ion
forc
h G
roup
, U Frequency based features are known to be helpful for ASR
Lower-frequency components contain most relevant
Com
puR
esea
rci
information for visual speech recognition Too many high-frequency components in the feature vector are not useful (contain information about wrinkles etc )
cv:h
c not useful (contain information about wrinkles etc.)
15
FFT based featureer
actio
nH
)based eatu eom
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tK
Normalization of an illumination condition
uter
Vis
ion
forc
h G
roup
, U
FFT
Com
puR
esea
rci
FFT
cv:h
c
(Smoothing)
16[□□□□□] feature vector
Discrete Cosine Transform (DCT)er
actio
nH
)( )
ompu
ter I
nte
Kar
lsru
he (T
H
• Transform the mouth image by DCT• Easy & Fast Implementation
or H
uman
-CU
nive
rsitä
tK • Compact respresentation
h b f C ffi i i
uter
Vis
ion
forc
h G
roup
, U • The number of DCT coefficients is too high – only coefficients with high energy are
Com
puR
esea
rci
y g gyused as elements of the feature vector
– the extracted coefficients are usually in the low frequency
cv:h
c q y
⎤⎡1 1M N
∑∑−
=
−
=⎥⎦⎤
⎢⎣⎡ ×
+×
+××=
1
0
1
0,, )
212cos()
212cos(
M
m
N
nnmjivu I
Nnv
MmuCCD ππ
17
Model-based approacheser
actio
nH
)pp
Deformable Templates
ompu
ter I
nte
Kar
lsru
he (T
H p
Uses a-priori knowledge
or H
uman
-CU
nive
rsitä
tK about the shape and appearance of the object
H d t d t i d l
uter
Vis
ion
forc
h G
roup
, U Hand-tuned parametric model and energy functionFitting by minimizing energy-f i
Com
puR
esea
rci
function
Model-parameters can be used for audio visual
cv:h
c used for audio-visual speech recognition
18
Model-based approaches (2)er
actio
nH
)pp ( )
Active Shape Models
ompu
ter I
nte
Kar
lsru
he (T
H Active Shape ModelsStatistical modelTrained on sample data
or H
uman
-CU
nive
rsitä
tK
Trained on sample dataFitting mainly based on shape
uter
Vis
ion
forc
h G
roup
, U
p
Com
puR
esea
rci
cv:h
c
• Shape and intensity parameters can be used for i l h iti
19
visual speech recognition
Hybrid Approaches er
actio
nH
)y pp
Active Appearance Model (AAM)
ompu
ter I
nte
Kar
lsru
he (T
H pp ( )Statistical modelAAM trains the correlation betweenh d
or H
uman
-CU
nive
rsitä
tK shape and appearanceOptimize parameters, so as to minimize the difference of a
uter
Vis
ion
forc
h G
roup
, U synthesized image and the target image
Fitting based on whole appearance of
Com
puR
esea
rci
Fitting based on whole appearance of the face
Model parameters used for visual
cv:h
c speech recognitionParameter models shape and texture
20
Summary of visual feature extractioner
actio
nH
)y
I i f ll d b h b d
ompu
ter I
nte
Kar
lsru
he (T
H In experiments for small databases, shape based methods outperform appearance based ones.
or H
uman
-CU
nive
rsitä
tK Relies on good lip-tracking
uter
Vis
ion
forc
h G
roup
, U
In experiments for large databases, appearance based methods seem to be superior to them
Com
puR
esea
rci
based methods seem to be superior to them.More robust than shape-based features
cv:h
c
21
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
Joint audio-visual speech recognition
uter
Vis
ion
forc
h G
roup
, U Joint audio-visual speech recognition
Com
puR
esea
rci
cv:h
c
22
Basic Processing Blockser
actio
nH
) Audio Data Visual Data
ompu
ter I
nte
Kar
lsru
he (T
H Audio Data
or H
uman
-CU
nive
rsitä
tK Face Detection
Lip Tracking
Audio Feature Extraction
uter
Vis
ion
forc
h G
roup
, U Lip Tracking
Com
puR
esea
rci
Visual Feature Extraction
cv:h
c
Audio-Visual ASR
23
Audio Visual ASR
The fundamentals of ASRer
actio
nH
)
1. Make HMMs of all phonemes from feature vectors (train)
ompu
ter I
nte
Kar
lsru
he (T
H / a /
FeatureHMM / a /
or H
uman
-CU
nive
rsitä
tK Feature extraction Training
uter
Vis
ion
forc
h G
roup
, U
Each states has an output probability of feature vectors
Com
puR
esea
rci
2. Recognize input speech with the trained HMMs (test)
cv:h
c
Input speech Trained HMMs
R ltF t24
Recognizing Result(text)
Feature extraction
Speech Recognition (S t C t )
erac
tion
H)
(System Components)Recognizer Components:
ompu
ter I
nte
Kar
lsru
he (T
H Recognizer Components:
or H
uman
-CU
nive
rsitä
tK
Front
RecognitionO1O2 OT
W1W2 W T
d
uter
Vis
ion
forc
h G
roup
, U
FrontEnd
Analog ObservationBest WordSequence
Decoder
Com
puR
esea
rci
AnalogSpeech
ObservationSequence
Sequence
cv:h
c
AcousticModel Dictionary Language
Model
25
erac
tion
H)
Continuous Speech Recognitionom
pute
r Int
eK
arls
ruhe
(TH
Goal:Given observed features O = o1, o2, ..., okFind word sequence W = w1 w2 wn
or H
uman
-CU
nive
rsitä
tK Find word sequence W = w1, w2, ... wnSuch that P(W | O) is maximized
Bayes Rule:
uter
Vis
ion
forc
h G
roup
, U
P(W | O) =P(O | W) • P(W)
acoustic model (HMMs) language modely
Com
puR
esea
rci
P(W | O) = P(O)
P(O) is a constant for a complete sentence
cv:h
c ( ) p
In the case of audio-visual speech recognition:
26
- maximise P(W|Oa, Ov)
Phoneme and visemeer
actio
nH
)
A h i th b i li i ti it d
ompu
ter I
nte
Kar
lsru
he (T
H A phoneme is the basic linguistic unit and acoustically distinguishable.
The English language can be classified into about 35
or H
uman
-CU
nive
rsitä
tK The English language can be classified into about 35-70 phonemes. ASR usually uses about 40 to 50 ones.
A viseme is visually distinguishable speech unit
uter
Vis
ion
forc
h G
roup
, U A viseme is visually distinguishable speech unit.Several phonemes can correspond to the same viseme.Number of visemes is much smaller than phonemes.
Com
puR
esea
rci
Number of visemes is much smaller than phonemes. Typically around 15No universal agreement about exact mapping between
cv:h
c phonemes and visemesIt highly depends on speakers and speaking style.
.27
The example of visemes in ASRer
actio
nH
)The example of visemes in ASR
Neti et al., Final Workshop 2000 at The Johns Hopkins Univ.
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
28
The phonems on each line belong to the same viseme.
erac
tion
H)
Audio Visual Speech Modeling for ASR
ompu
ter I
nte
Kar
lsru
he (T
H
How should we model audio and visual features f ASR?
or H
uman
-CU
nive
rsitä
tK for ASR?
uter
Vis
ion
forc
h G
roup
, U
What is the relation between audio and visual
Com
puR
esea
rci
What is the relation between audio and visual features like?
cv:h
c
29
Characteristics between audio and visual featureser
actio
nH
)Characteristics between audio and visual features• Audio and Visual phonetic events happen
synchronously with time lag
ompu
ter I
nte
Kar
lsru
he (T
H synchronously with time lag
Example:speech “aida”
or H
uman
-CU
nive
rsitä
tK
speec a da
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
Time lag
After lip is opened, a voice is uttered.
30
After lip is opened, a voice is uttered.After finishing to utter, the lip is closed
Techniques integrating audio and visual information
erac
tion
H)
information
• Feature fusion
ompu
ter I
nte
Kar
lsru
he (T
H
- combines audio and visual information at a feature vector level
or H
uman
-CU
nive
rsitä
tK feature vector level.
- One classifier is used.
uter
Vis
ion
forc
h G
roup
, U
• Decision fusion
Com
puR
esea
rci
- integrates audio and visual information at a classifier level
cv:h
c classifier level.
- two classfiers, audio and visual classifiers, are
31
used.
Feature fusioner
actio
nH
)om
pute
r Int
eK
arls
ruhe
(TH
Feature fusion uses a single classifier to model the d f i h di d
or H
uman
-CU
nive
rsitä
tK concatenated vector of time-synchronous audio and visual features.
uter
Vis
ion
forc
h G
roup
, U
1. A simple concatenation
Com
puR
esea
rci
p2. Hierarchical LDA feature fusion
cv:h
c
32
Hierarchical LDA feature fusioner
actio
nH
) Audio feature vector Visual feature vector
Potamianos et al., ICASSP, 2001
ompu
ter I
nte
Kar
lsru
he (T
H Audio feature vector Visual feature vector
LDAConcatenation of adjacent
or H
uman
-CU
nive
rsitä
tK
jframe vectors
Concatenation of adjacent frame vectors
uter
Vis
ion
forc
h G
roup
, U LDAframe vectors
LDA
Com
puR
esea
rci Concatenation of audio & visual vectors
cv:h
c
LDA
Concatenation of audio & visual vectors
33
LDA
Audio visual feature vector
Overview of IBM‘s systemer
actio
nH
)Overview of IBM s system
Potamianos et al 2004
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
34
Decision fusioner
actio
nH
)
Cl ifi i t ti t hidd t t l l
ompu
ter I
nte
Kar
lsru
he (T
H Classifier integration at a hidden state levelSynchronous Multi-Stream HMMs
or H
uman
-CU
nive
rsitä
tK Intermediate integrationClassifier integration at a phone or word level
uter
Vis
ion
forc
h G
roup
, U Classifier integration at a phone or word levelAsynchronous Product HMM
Com
puR
esea
rci
Intermediate integrationClassifier integration at an utterance level
cv:h
c C ass e teg at o at a utte a ce eveLate integration
35
A scheme of classifier integration at a hidden state level
erac
tion
H)
state levelom
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tKut
er V
isio
n fo
rch
Gro
up, U
Recognition(A di i l)
Com
puR
esea
rci
(Audio visual)
cv:h
c
36
Synchronous multi-stream HMMser
actio
nH
)om
pute
r Int
eK
arls
ruhe
(TH
Audio HMM 1 2 3
or H
uman
-CU
nive
rsitä
tK
1 2 3Visual HMM
uter
Vis
ion
forc
h G
roup
, U
))((, tP vjv O
aλ vλ×output probability of an audio-visual feature at a state j =
))(( tP O
Com
puR
esea
rci
))((, tP aja O
a v
: output probability of an audio feature
))((, tP aja O
cv:h
c
))((, tP vjv O : output probability of an visual featureλ λ :Stream weights which represent reliabilities of audio
37
aλ vλ :Stream weights which represent reliabilities of audio and visual information.
A scheme of classifier integration at a phone or word level (Intermediate integration)
erac
tion
H)
word level (Intermediate integration)om
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tKut
er V
isio
n fo
rch
Gro
up, U
Com
puR
esea
rci
cv:h
c
38
Asynchronous Product HMMer
actio
nH
)y
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
Output Probability at State ij :
39
va vvj
aaiij ObObOb λλ )()()( )()()()( ×=
Re-training Product HMMer
actio
nH
) ( asynchronous event )
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
40
A typical scheme of classifier integration at l l (L i i )
erac
tion
H)
utterance level (Late integration)om
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tKut
er V
isio
n fo
rch
Gro
up, U
Com
puR
esea
rci
cv:h
c
41
Late integration (LI) er
actio
nH
)g ( )
Integration at an utterance level
ompu
ter I
nte
Kar
lsru
he (T
H
vvisual
aaudioresult LLL λλ
)()()( ×=Integration at an utterance level
or H
uman
-CU
nive
rsitä
tK
∏=
=
=tendt
tstarttatjaaudio tPL ))(()(,)( O
uter
Vis
ion
forc
h G
roup
, U =tstartt
Output probability of an audio utterance
Com
puR
esea
rci
∏=
=tendt
tstarttvtjvvisual tPL ))(()(,)( O
cv:h
c =tstartt
Output probability of an visual utterance
42
Summaryer
actio
nH
) Synchronous Multi-Stream HMMs
ompu
ter I
nte
Kar
lsru
he (T
H
- decides phoneme’s durations based on audio labels cannot sufficiently represent visual features.
or H
uman
-CU
nive
rsitä
tK
y pLate integration - processes independently audio and visual data
uter
Vis
ion
forc
h G
roup
, U - processes independently audio and visual dataignore the synchronization between audio and
visual features
Com
puR
esea
rci
visual features- runs two process when recognizing speech
i t ti ( S i bl i l
cv:h
c increase computation ( Serious problem in large vocabulary speech recognition )
43
Discussionser
actio
nH
)Discussions
Advantages of Intermediate integration
ompu
ter I
nte
Kar
lsru
he (T
H g g• Asynchronous (AS) vs. Synchronous multi-stream
HMMs
or H
uman
-CU
nive
rsitä
tK - AS HMMs allows audio and visual events to occur asynchronously.
can represent the relationship between audio and
uter
Vis
ion
forc
h G
roup
, U can represent the relationship between audio and visual feature.
Com
puR
esea
rci
• Intermediate integration (II) vs. Late integration- One path (Viterbi) algorithm is available.
cv:h
c II doesn’t need to run two processes
A disadvantage of Intermediate integration
44
A disadvantage of Intermediate integration • It uses lot of memory.
Results of a word recognition experiment (Audio SNR 5dB)
erac
tion
H)
SNR -5dB)om
pute
r Int
eK
arls
ruhe
(TH
or H
uman
-CU
nive
rsitä
tKut
er V
isio
n fo
rch
Gro
up, U
Sychronous multi-stream HMMs
Com
puR
esea
rci
cv:h
c
45
How can we decide which information is reliable?er
actio
nH
)
-Estimating which information more reliable is→improves the recognition performance
ompu
ter I
nte
Kar
lsru
he (T
H →improves the recognition performance
Big acoustic noises → Visual information is more reliable
or H
uman
-CU
nive
rsitä
tK Big acoustic noises → Visual information is more reliableBig image noises → Audio information is more reliable
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
46
Estimating Stream Weights er
actio
nH
)g g
O tp t Probabilit at a State ij
ompu
ter I
nte
Kar
lsru
he (T
H
va vvj
aaiij ObObOb λλ )()()( )()()()( ×=
Output Probability at a State ij :
or H
uman
-CU
nive
rsitä
tK jiij ObObOb )()()(
Purpose:
uter
Vis
ion
forc
h G
roup
, U
Audio stream weight aλ
pAutomatically optimize
AND
Com
puR
esea
rci
g a
vλVisual stream weight AND
cv:h
c
What measure is appropriate measure in order
47
to estimate them?
What is the confidence measure to estimate er
actio
nH
)
stream exponents?om
pute
r Int
eK
arls
ruhe
(TH
Based on minimum classification error criterionadjust weights during training phasekeeps weights fixed during testing!
or H
uman
-CU
nive
rsitä
tK
Use Stream entropystrong peak in log-likelihood of HMMs (entropy close to zero) indicates strong
fid
uter
Vis
ion
forc
h G
roup
, U confidence
N-best output score dispersionIf h i i did l h fi h
Com
puR
esea
rci
If the competitive candidates are closer to the first one, that modality is considered as unreliable one.
B d di i l t i ti (SNR)
cv:h
c Based on audio signal-to-noise ratio (SNR)The worse the audio SNR gets, the higher the weight of the video stream (does not consider any video noise!)
Train something (e.g. ANNs) to learn best weights48
Comparision of the confidence measureser
actio
nH
)
S W d
Potamianos et al. ICSLP2000
ompu
ter I
nte
Kar
lsru
he (T
H System Word accuracy
Audio only 50 38%
or H
uman
-CU
nive
rsitä
tK Audio-onlyVisual-onlyStream entropy
50.38%28.34%54 44%
uter
Vis
ion
forc
h G
roup
, U Stream entropyN-best output score dispersionA erage of N best o tp t scores
54.44%55.19%55 05%
Com
puR
esea
rci
Average of N-best output scoresMinimum classification error
55.05%59.88%
cv:h
c
Experimental conditions) Context independent GMM with 5 mixtures
49
Context independent GMM with 5 mixtures
Word Error Rate for Audio SNRer
actio
nH
)
Potamianos et al., MIT press
ompu
ter I
nte
Kar
lsru
he (T
Hor
Hum
an-C
Uni
vers
itätK
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
50
Summaryer
actio
nH
)y
H i li i l d i l b h d li i
ompu
ter I
nte
Kar
lsru
he (T
H Humans implicitely and unconsciously use both modalities, speech and visual appearance
U i b h d li i i i h i i
or H
uman
-CU
nive
rsitä
tK Using both modalities improves automatic speech recognitionboth for humans and for automatic computer systemsin particular under noisy audio conditions
uter
Vis
ion
forc
h G
roup
, U
Video featuresappearance-based: transformed image of the lip-region is used for recognition
Com
puR
esea
rci
recognitionnormalized greyscale image, FFT, DCT, (plus LDA, PCA)
model-based: lip-model is extracted, recognition is based on (transformed) model parameters
cv:h
c
active shape models, active contours, snakesHybrid approach: active appearance models
51
Summary (2)er
actio
nH
)y ( )
Phonemes and Visemes
ompu
ter I
nte
Kar
lsru
he (T
H Phonemes and VisemesVisemes are classes of visually distinguishable sounds
or H
uman
-CU
nive
rsitä
tK
Classification typically with HMMs
uter
Vis
ion
forc
h G
roup
, U
Fusion on various levels possibleEarly feature integration
Com
puR
esea
rci
y gLate integration (word or phoneme/viseme-level) Intermediate integration seems to work best
cv:h
c at sub-phone/viseme level (HMM-states)synchronous, asynchrounous Multi-stream HMMs
52
Referenceser
actio
nH
) Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, Iain
ompu
ter I
nte
Kar
lsru
he (T
H
Matthews, Audio-Visual Automatic Speech Recognition: An Overview, Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds., MIT Press, 2004
or H
uman
-CU
nive
rsitä
tK
y, , , , ,
uter
Vis
ion
forc
h G
roup
, UC
ompu
Res
ear
cicv
:hc
53
top related