Media Coordination in SmartKom
Norbert Reithinger
Dagstuhl Seminar “Coordination and Fusion in Multimodal Interaction”
Deutsches Forschungszentrum für Künstliche Intelligenz GmbHStuhlsatzenhausweg 3, Geb. 43.1 - 66123 Saarbrücken
Tel.: (0681) 302-5346Email: [email protected]/~bert
30.10.2001 2© NR
Overview
• Situated Delegation-oriented Dialog Paradigm• More About the System Software• Media Coordination Issues• Media Processing: The Data Flow• Processing the User‘s State• Media Fusion• Media Design • Conclusion
30.10.2001 3© NR
The SmartKom Consortium
MediaInterfaceEuropean Media LabUinv. Of
MunichUniv. ofStuttgart
Saarbrücken
Aachen
Dresden Berkeley
Stuttgart
MunichUniv. ofErlangen
Heidelberg
Main Contractor
DFKISaarbrücken
Project Budget: € 25.5 millionProject Duration: 4 years (September 1999 – September 2003)
Ulm
30.10.2001 4© NR
Situated Delegation-oriented Dialog Paradigm
User specifies goal
delegates task
cooperate
on problems
asks questions
presents results
Service 1 Service 1
Service 2Service 2
Service 3Service 3
IT Services
PersonalizedInteraction
Agent
Smartakus
30.10.2001 5© NR
More About the System
30.10.2001 6© NR
More About the System
• Modules realized as independent processes• Not all must be there (critical path: speech or graphic input
to speech or graphic output)• (Mostly) independent from display size • Pool Communication Architecture (PCA) based on PVM for
Linux and NT• Modules know about their I/O pools• Literature:
– Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann:Verbmobil From a Software Engineering Point of View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of Speech-To-Speech Translation. Springer, 2000.
• Data exchanged using M3L documents C:\Documents and Settings\
bert\Desktop\SmartKom-Systeminfo\index.html • All modules and pools are visualized here ...
30.10.2001 7© NR
30.10.2001 8© NR
Media Coordination Issues
• Input:– Speech
• Words
• Prosody: boundaries, stress, emotion
• Mimics: neutral, anger
– Gesture:
• Touch free (scenario public)
• Touch sensitive screen
• Output:– Display objects
– Speech
– Agent: posture, gesture, lip movement
30.10.2001 9© NR
Media Processing: The Data Flow
Display Objects with ref ID
and Location
Dialog-Core
Presentation(Media Design)
Media Fusion
User State Domain Information System State
Speech Speech
Agent‘s Posture and Behaviour
Mimics(Neutral or Anger)
InteractionModeling
Prosody (emotion)
Gesture
30.10.2001 10© NR
The Input/Output Modules
30.10.2001 11© NR
Processing the User‘s State
30.10.2001 12© NR
Processing the User‘s State
• User state: neutral and anger• Recognized using mimics and prosody• In case of anger activate the dynamic help in the
Dialog Core Engine
• Elmar Nöth will hopefully tell you more about this in his talk Modeling the User State - The Role of Emotions
30.10.2001 13© NR
Media Fusion
30.10.2001 14© NR
Gesture Processing
• Objects on the screen are tagged with IDs
• Gesture input– Natural gestures recognized by SIVIT
– Touch sensitive screen
• Gesture recognition– Location
– Type of gesture: pointing, tarrying, encircling
• Gesture Analysis– Reference object in the display described as XML
domain model (sub-)objects (M3L schemata)
– Bounding box
– Output: gesture lattice with hypotheses
30.10.2001 15© NR
• Speech Recognizer produces word lattice
• Prosody inserts boundary and stress information
• Speech analysis creates intention hypotheses with markers for deictic expressions
Speech Processing
30.10.2001 16© NR
Media Fusion
• Integrates gesture hypotheses in the intention hypotheses of speech analysis
• Information restriction possible from both media • Possible but not necessary correspondence of
gestures and placeholders (deictic expressions/ anaphora) in the intention hypothesis
• Necessary: Time coordination of gesture and speech information
• Time stamps in ALL M3L documents!!• Output: sequence of intention hypothesis
30.10.2001 17© NR
Media Design (Media Fission)
30.10.2001 18© NR
Media Design
• Starts with action planning• Definition of an abstract presentation goal• Presentation planner:
– Selects presentation, style, media, and agent‘s general behaviour
– Activates natural language generator which activates the speech synthesis which returns audio data and time-stamped phoneme/viseme sequence
• Character Animation realizes the agent‘s behaviour
• Synchronized presentation of audio and visual information
30.10.2001 19© NR
Lip Synchronization with Visemes
• Goal: present a speech prompt as natural as possible
• Viseme: elementary lip positions• Correspondence of visemes and phonemes• Examples:
30.10.2001 20© NR
Behavioural Schemata
• Goal: Smartakus is always active to signal the state of the system
• Four main states – Wait for user‘s input
– User‘s input
– Processing
– System presentation
• Current body movements– 9 vital, 2 processing, 9 presentation (5 pointing, 2
movements, 2 face/mouth)
– About 60 basic movements
30.10.2001 21© NR
Conclusion
• Three implemented systems (Public, Home, Mobile)
• Media coordination implemented• „Backbone“ uses declarative knowledge sources
and is rather flexible• Lot‘s remains to be done
– Robustness
– Complex speech expressions
– Complex gestures (shape and timing)
– Implementation of all user states
– ....
• Reuse of modules in other contexts, e.g. in MIAMM