media coordination in smartkom norbert reithinger dagstuhl seminar “coordination and fusion in...

Post on 19-Dec-2015

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Media Coordination in SmartKom

Norbert Reithinger

Dagstuhl Seminar “Coordination and Fusion in Multimodal Interaction”

 

Deutsches Forschungszentrum für Künstliche Intelligenz GmbHStuhlsatzenhausweg 3, Geb. 43.1 - 66123 Saarbrücken

Tel.: (0681) 302-5346Email: bert@dfki.dewww.smartkom.orgwww.dfki.de/~bert

30.10.2001 2© NR

Overview

• Situated Delegation-oriented Dialog Paradigm• More About the System Software• Media Coordination Issues• Media Processing: The Data Flow• Processing the User‘s State• Media Fusion• Media Design • Conclusion

30.10.2001 3© NR

The SmartKom Consortium

MediaInterfaceEuropean Media LabUinv. Of

MunichUniv. ofStuttgart

Saarbrücken

Aachen

Dresden Berkeley

Stuttgart

MunichUniv. ofErlangen

Heidelberg

Main Contractor

DFKISaarbrücken

Project Budget: € 25.5 millionProject Duration: 4 years (September 1999 – September 2003)

Ulm

30.10.2001 4© NR

Situated Delegation-oriented Dialog Paradigm

User specifies goal

delegates task

cooperate

on problems

asks questions

presents results

Service 1 Service 1

Service 2Service 2

Service 3Service 3

IT Services

PersonalizedInteraction

Agent

Smartakus

30.10.2001 5© NR

More About the System

30.10.2001 6© NR

More About the System

• Modules realized as independent processes• Not all must be there (critical path: speech or graphic input

to speech or graphic output)• (Mostly) independent from display size • Pool Communication Architecture (PCA) based on PVM for

Linux and NT• Modules know about their I/O pools• Literature:

– Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann:Verbmobil From a Software Engineering Point of View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of Speech-To-Speech Translation. Springer, 2000.

• Data exchanged using M3L documents C:\Documents and Settings\

bert\Desktop\SmartKom-Systeminfo\index.html • All modules and pools are visualized here ...

30.10.2001 7© NR

30.10.2001 8© NR

Media Coordination Issues

• Input:– Speech

• Words

• Prosody: boundaries, stress, emotion

• Mimics: neutral, anger

– Gesture:

• Touch free (scenario public)

• Touch sensitive screen

• Output:– Display objects

– Speech

– Agent: posture, gesture, lip movement

30.10.2001 9© NR

Media Processing: The Data Flow

Display Objects with ref ID

and Location

Dialog-Core

Presentation(Media Design)

Media Fusion

User State Domain Information System State

Speech Speech

Agent‘s Posture and Behaviour

Mimics(Neutral or Anger)

InteractionModeling

Prosody (emotion)

Gesture

30.10.2001 10© NR

The Input/Output Modules

30.10.2001 11© NR

Processing the User‘s State

30.10.2001 12© NR

Processing the User‘s State

• User state: neutral and anger• Recognized using mimics and prosody• In case of anger activate the dynamic help in the

Dialog Core Engine

• Elmar Nöth will hopefully tell you more about this in his talk Modeling the User State - The Role of Emotions

30.10.2001 13© NR

Media Fusion

30.10.2001 14© NR

Gesture Processing

• Objects on the screen are tagged with IDs

• Gesture input– Natural gestures recognized by SIVIT

– Touch sensitive screen

• Gesture recognition– Location

– Type of gesture: pointing, tarrying, encircling

• Gesture Analysis– Reference object in the display described as XML

domain model (sub-)objects (M3L schemata)

– Bounding box

– Output: gesture lattice with hypotheses

30.10.2001 15© NR

• Speech Recognizer produces word lattice

• Prosody inserts boundary and stress information

• Speech analysis creates intention hypotheses with markers for deictic expressions

Speech Processing

30.10.2001 16© NR

Media Fusion

• Integrates gesture hypotheses in the intention hypotheses of speech analysis

• Information restriction possible from both media • Possible but not necessary correspondence of

gestures and placeholders (deictic expressions/ anaphora) in the intention hypothesis

• Necessary: Time coordination of gesture and speech information

• Time stamps in ALL M3L documents!!• Output: sequence of intention hypothesis

30.10.2001 17© NR

Media Design (Media Fission)

30.10.2001 18© NR

Media Design

• Starts with action planning• Definition of an abstract presentation goal• Presentation planner:

– Selects presentation, style, media, and agent‘s general behaviour

– Activates natural language generator which activates the speech synthesis which returns audio data and time-stamped phoneme/viseme sequence

• Character Animation realizes the agent‘s behaviour

• Synchronized presentation of audio and visual information

30.10.2001 19© NR

Lip Synchronization with Visemes

• Goal: present a speech prompt as natural as possible

• Viseme: elementary lip positions• Correspondence of visemes and phonemes• Examples:

30.10.2001 20© NR

Behavioural Schemata

• Goal: Smartakus is always active to signal the state of the system

• Four main states – Wait for user‘s input

– User‘s input

– Processing

– System presentation

• Current body movements– 9 vital, 2 processing, 9 presentation (5 pointing, 2

movements, 2 face/mouth)

– About 60 basic movements

30.10.2001 21© NR

Conclusion

• Three implemented systems (Public, Home, Mobile)

• Media coordination implemented• „Backbone“ uses declarative knowledge sources

and is rather flexible• Lot‘s remains to be done

– Robustness

– Complex speech expressions

– Complex gestures (shape and timing)

– Implementation of all user states

– ....

• Reuse of modules in other contexts, e.g. in MIAMM

top related