eeg-based speech recognition: impact of experimental

EEG-based Speech Recognition:Impact of Experimental Design on

Performance

Studienarbeit am Institut fur Algorithmen und Kognitive SystemeProf. Dr. T. Schultz

Fakultat fur InformatikUniversitat Karlsruhe (TH)

von

cand. inform.Anne K. Porbadnigk

Betreuer:

Prof. Dr. T. Schultz

Tag der Anmeldung: 15. 02. 2008Tag der Abgabe: 15. 05. 2008

Ich erklare hiermit, dass ich die vorliegende Arbeit selbstandig verfasst und keineanderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Karlsruhe, den 15.05.2008

v

Abstract

Electroencephalography (EEG) has proven to be a valuable method for communica-tion for speech-impaired persons. Recently, attempts have been made to use EEGdata for the recognition of unspoken speech. Unspoken speech means that a subjectthinks a given word without the use of the facial muscles and without uttering anyaudible sound. In spite of promising first results [40], the question has been raised in[9] whether the system may have recognized temporal patterns in the data insteadof actual words since the words were presented in blocks.

The contribution of this thesis is to investigate in depth the impact of experimentaldesign on the recognition rate of unspoken speech. For this study, 23 subjects wererecorded for 71 sessions in total, using a high-density EEG cap with 128 electrodesout of which 16 were recorded. The vocabulary domain consisted of 5 words each ofwhich were repeated 20 times. Between sessions, the order in which the words werepresented has been varied as well as the length of the breaks. Besides the previouslyused orders blocks, sequential and random, short blocks were tested consisting ofblocks of 5 words. It could be shown that except for the block mode which yielded anaverage recognition rate of 45.50%, all other modes had recognition rates at chancelevel. This may be partially explained by the fact that block data contains less noiseand the fact that this mode facilitates the task to think words in a consistent way.However, it could be shown that temporal artifacts indeed superimpose the signalof interest in block recordings.

vi

Acknowledgements

I want to thank my advisor, Prof. Tanja Schultz for her guidance and support forthis thesis. Many thanks also to Kristina Schaaff for the valuable discussions on EEGand her constant help. I also want to express my gratitude to Michael Wand who wasof invaluable help for handling the technical aspects of the previously implementedsystem. Thanks a lot to Marek Wester who also took his time to provide me withmore information about the previously run experiments. Finally, I want to thankmy numerous subjects for their time and interest, their challenging questions andnew ideas. Without them this work had not been possible. Special thanks to myfamily and boyfriend who apart from having to stand my frustration at times werealso helping me as subjects.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Anatomical Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Electrical Activity of the Brain . . . . . . . . . . . . . . . . . 52.1.2 Speech Processing in the Brain . . . . . . . . . . . . . . . . . 5

2.2 EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Introduction to EEG . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Assessment of EEG for the given Task . . . . . . . . . . . . . 9

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Brain Computer Interfaces . . . . . . . . . . . . . . . . . . . . 102.3.2 Work directly linked to this Thesis . . . . . . . . . . . . . . . 12

3 Experimental Setup 153.1 Recording Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 EEG Recording Hardware . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 EEG Cap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 VarioPortTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Processing and Classification Software . . . . . . . . . . . . . . . . . 193.3.1 UKA EMG/EEG Studio . . . . . . . . . . . . . . . . . . . . . 193.3.2 JANUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Corpus Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.1 Vocabulary Domain . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3 Variation of Presentation Mode . . . . . . . . . . . . . . . . . 22

3.4.3.1 Variation of Breaks . . . . . . . . . . . . . . . . . . . 233.4.3.2 Variation of Word Order . . . . . . . . . . . . . . . . 24

3.5 Recording Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5.2 Preliminary Procedure . . . . . . . . . . . . . . . . . . . . . . 273.5.3 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Experiments and Results 314.1 Variation of Break Length . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Initial Experiments with Break Length . . . . . . . . . . . . . 32

viii Contents

4.1.2 Break Length Revisited . . . . . . . . . . . . . . . . . . . . . 334.2 Variation of Word Order . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Blocks, Randomized and Sequential . . . . . . . . . . . . . . . 344.2.2 Blocks, Randomized and Short Blocks . . . . . . . . . . . . . . 364.2.3 Reordered Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.4 Overview: Impact of Word Order . . . . . . . . . . . . . . . . 38

4.3 Cross Session Experiments . . . . . . . . . . . . . . . . . . . . . . . . 384.3.1 Cross Mode Testing . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Cross Session Testing with Blocks . . . . . . . . . . . . . . . . 41

4.4 Variation of the HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.1 Variation of Gaussians per State . . . . . . . . . . . . . . . . . 434.4.2 Variation of the Number of HMM States . . . . . . . . . . . . 444.4.3 Possible Reasons for Differences to previous findings . . . . . . 45

4.5 Examination of the recorded Data . . . . . . . . . . . . . . . . . . . . 464.5.1 Eliminating Useless Recordings . . . . . . . . . . . . . . . . . 464.5.2 Length of the Utterances . . . . . . . . . . . . . . . . . . . . . 48

4.6 Impact of Handedness on Recognition Rate . . . . . . . . . . . . . . . 494.7 Comments on the Vocabulary Domain . . . . . . . . . . . . . . . . . 50

5 Analysis of the Results 53

6 Summary and Future Work 576.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A Documents for Recordings 61

B Data 65

Bibliography 73

List of Figures

2.1 Wernicke-Geschwind-Model . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Schema of EEG measurement . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Setting of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Layout of the high-density EEG cap . . . . . . . . . . . . . . . . . . . 17

3.3 Subject with EEG cap on canvas chair . . . . . . . . . . . . . . . . . 18

3.4 Phases of a recording step . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Overview over the subjects . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 List of recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Variation of break length . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Break Length Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Variation of word order (I) . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Variation of word order II . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Impact of temporal closeness . . . . . . . . . . . . . . . . . . . . . . . 39

4.6 Word order overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.7 Cross Mode testing: Gain by training on data recorded in blocks . . . 41

4.8 Cross session testing with data recorded in blocks, subject 04 . . . . . 42

4.9 Cross session testing with data recorded in blocks, subject 05 . . . . . 42

4.10 Recordings chosen for experiments with HMM . . . . . . . . . . . . . 43

4.11 Variation of GMM for word order blocks . . . . . . . . . . . . . . . . 44

4.12 Variation of GMM for word order sequential . . . . . . . . . . . . . . 44

4.13 Variation of HS for word order blocks . . . . . . . . . . . . . . . . . . 45

4.14 Influence of the data subset (choice based on Viterbi score) on recog-nition rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.15 Average number of frames per utterance . . . . . . . . . . . . . . . . 49

4.16 Influence of handedness on recognition rate . . . . . . . . . . . . . . . 51

x List of Figures

A.1 Instruction handed out to the subjects . . . . . . . . . . . . . . . . . 62

A.2 Protocol for the Recording Sessions . . . . . . . . . . . . . . . . . . . 63

B.1 Recognition rates for varying break types . . . . . . . . . . . . . . . . 65

B.2 Recognition rates for varying word orders (I) . . . . . . . . . . . . . . 65

B.3 Recognition rates for varying word orders (II) . . . . . . . . . . . . . 66

B.4 Overview over recognition rates for all recordings . . . . . . . . . . . 67

B.5 Summed confusion matrices for blocks, reorderedBlocks and randomized 68

B.6 Data of cross session experiments . . . . . . . . . . . . . . . . . . . . 69

B.7 Recognition results for different HMMs . . . . . . . . . . . . . . . . . 70

B.8 Recognition rates for subsets of data (choice based on Viterbi score) . 71

B.9 Average number of frames and standard deviation . . . . . . . . . . . 71

List of Tables

3.1 Technical specifications of VarioPortTM. . . . . . . . . . . . . . . . . 18

3.2 Total number of sessions in the database for different word orders anddifferent break types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Total number of subjects and sessions in the database for differentword orders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Total number of subjects and sessions in the database for differentbreak types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Codes for subjects’ state concerning drugs . . . . . . . . . . . . . . . 26

4.1 Overview over the recognition rates (%) depending on the break type. 33

4.2 Overview over the recognition rates (%) depending on the word order. 40

4.3 Data chosen for cross mode experiments. . . . . . . . . . . . . . . . . 41

4.4 Overview over the best HMM configurations when either the numberof Gaussians per state (GMM) or the number of HMM states (HS) isvaried. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Data Chosen for Viterbi Experiment. . . . . . . . . . . . . . . . . . . 47

4.6 Definition of Data Subsets for Viterbi Experiment. . . . . . . . . . . 47

4.7 Average amount of 0 viterbi scores for the different word orders. . . . 48

4.8 Details for the number of frames for session 09-02 (blocks) dependingon the word that was uttered. . . . . . . . . . . . . . . . . . . . . . . 49

1. Introduction

Electroencephalography (EEG) has proven to be useful for a multitude of newmethods of communication besides the well-known clinical applications. In recentyears, speech recognition has made a lot of progress and has facilitated our lives byproviding a new, more natural means of communication with machines. Previousresearch has proved that it is feasible to link these two ideas, that is to use EEGfor the recognition of normal speech ([40]). This has been taken one step further byinvestigations of whether it is feasible to recognize unspoken speech based on theEEG signals recorded ([9, 40]). Unspoken speech as opposed to normal speech isdefined as follows:

Definition. Unspoken speech means that a subject thinks a given word withoutmoving the articulatory muscles and without uttering any audible sound.

The contribution of this thesis is to investigate which impact the experimental designhas on the recognition rate for unspoken speech. This information is necessary inorder to decide whether the promising first results were overestimated.

1.1 Motivation

Investigating the EEG-based recognition of unspoken speech is a worthwhile topicbecause it enables us to enhance both speech recognition and our knowledge of thefunctionality of the brain.

Speech recognition is helpful in everyday life but up to now, it usually depends onaudible spoken utterances. However, we can think of two cases in which unspokenspeech is preferable. First, there are situations where using spoken speech is un-desirable or even unfeasable, for instance in quiet settings or environments whereuttering speech is impossible. If we want to recognize speech under these circum-stances, using unspoken speech seems perfect.

Second, there are people who are not able to utter speech due to a physical disability.For instance, locked-in patients have extremely limited possibilities for communicat-ing with their environment. So far, systems have been developed that enable these

2 1. Introduction

people to give basic commands to a computer [4]. However, it would be more naturaland faster to have their actual thoughts recognized.

Finally, investigating EEG-based recognition of unspoken speech will enhance ourknowledge about how the brain works. Specifically, it may shed some light on thequestion of how a ’thought’ is processed in the brain.

1.2 Ethical Considerations

I hereby declare explicitely that I will not misuse my research for invading anotherpersons privacy against his or her will, that is to read his or her brain without hisor her collaboration. The purpose of this work and its use is solely to enhance ourknowledge about how the brain works and to develop a new technique of communi-cation, mainly for the benefit of speech-impaired people.

Some may argue that apart from this, the results of this research can be misusedby others with obscure aims, for example for building a lie detector. However,there are technical reasons which would most probably prevent this. First of all, awilling subject is needed for producing data which the system can recognize. Ourcurrent efforts are on recognizing a clearly defined small set of five words, artificiallyseparated by eye blinks. The subject needs to concentrate on the words and tryto imagine how to speak them. The normal everyday situation is rather to have amultitude of thoughts at the same time which are not segmented at all. This wouldalso be the case for a subject which is not willing to cooperate.

All subjects who took part in this study gave their informed consent that their datamay be used for investigating EEG-based speech recognition.

1.3 Purpose of this Work

Two other theses have been compiled before on the topic of whether EEG-basedunspoken speech recognition is feasible or not. Whereas the author of the firstof these [40] came to the conclusion that the recognition of unspoken speech wasfeasible, the author of the second thesis [9] hypothesized that temporal patternswere recognized instead of words. For more details on both studies, refer to section2.3.2.

The two contradicting positions can be formulated as follows:

Hypothesis A. Unspoken speech can be recognized based on EEG signals.

Hypothesis B. The recognition results reported in [40] were overestimated due totemporal brain activity patterns that were recognized instead of words.

The aim of this study is to find out which one of these hypotheses is correct. In[40], the words were mostly presented in blocks to the subjects. However, thispresentation mode is vulnerable to temporal artifacts. Therefore, it was necessaryto produce a significant amount of training data in other modes than the potentiallymisleading blocks presentation mode. Specifically, it was of importance to producedata which was recorded in the same recording session (that is without removingthe EEG cap) but with varying experimental setups such that these sessions can

1.4. Structure of this Thesis 3

be compared. The study was run with 23 subjects, a number that is significantlyhigher than in previous studies (refer to section 2.3.2).

As described in section 2.3.2, previous studies only used the word orders blocks,sequential and random. Apart from studying these in more depth, experimentswere conducted with new presentation modes which had not been used before (shortblocks, reordered blocks). Furthermore, experiments were run with varying pauselengths which had not been done before either.

1.4 Structure of this Thesis

The thesis is organized as follows.

• First, a general introduction to the anatomical basics and EEG measurementsis given in section 2, as well as a description of related work.

• Section 3 contains a detailed description of how the EEG data is obtained.Both the setup of the EEG recordings and the protocol of the recording processare described in detail.

• Section 4 describes the experiments with the recorded data, some of which arebased on the experimental setup while others are conducted susequently to therecordings.

• The results of these recordings are analyzed in section 5.

• Finally, a summary of the study is provided in section 6 which also encompassesideas for future work.

4 1. Introduction

2. Background

The following two sections about anatomical basics and EEG are mainly based oninformation gathered from [37] and [28]. A good and comprehensive introduction toEEG can be found in [15] and summaries in [9] and [40].

2.1 Anatomical Basics

The brain can be separated into three primary divisions: the brainstem, the cerebel-lum and the cerebrum. The outer part of the cerebrum which is called cerebral cortexcontains a high concentration of nerve cells, the so-called neurons (about 1010). Thecortex is believed to be the structure that generates all electrical potentials that canbe measured at the scalp with the exception of those with the lowest magnitude.While the cortex is composed mainly of cell bodies (the so-called gray matter), thestructure underneath is composed mainly of axons and is called white matter. Thiswhite matter contains a high number of association fibers, which connect differentregions of the brain.

2.1.1 Electrical Activity of the Brain

EEG activitiy is controlled by subcortical structures. Although nerve cells are themain source of the variation of the potential, glia cells play a role as well. SurfaceEEG can be described as the summation of the synchronous activity of a largeamount of neurons that have a similar spatial orientation.

2.1.2 Speech Processing in the Brain

Brain Areas involved in Speech Processing

Broca’s Area, Wenicke’s Area and the areas for motor control in the primary motorcortex are widely considered to be of importance for speech production. These canbe seen in Figure 2.1.

Both, Broca’s and Wernicke’s area, are usually located on the left side of the brain.Although the role of Broca’s area in speech production is still uncertain, there are

6 2. BackgroundCHAPTER 2. BACKGROUND 14

Figure 2.6: A graphical representation of the Wernicke-Geschwind-Model[6]

During the different modalities the muscle movement decreases more with every modality

until in the unspoken modality no muscle movement is involved at all. Through this process

the involvement of the primary motor cortex gets lower. But as we believe the involvement

of the other regions involved in speech production stays at a level that pattern recognition

is still possible because speech is still produced. The Wernicke-Geschwind-Model stays valid

because unspoken speech as defined by us is speech without muscle movement. But still

movement patterns should be produced in the Broca’s area which then should be recognized.

No mind reading should be done, just patterns should be recognized in the process of speech

production as described in the Wernicke-Geschwind-Model.

2.5 Cap

The cap that was used for the recordings was supplied by Electro-Cap International, Inc2.

It is equipped with 20 electrodes using the International 10-20 method [19]. It is made of

an elastic spandex-type fabric. The electrodes are made of Ag/AgCL and are recessed and

attached to the fabric. Because they do not touch the skin of the subject directly they have

to be filled with a conductive gel as shown in picture 2.7. The process of filling the electrodes

2http://www.electro-cap.com/

Figure 2.1: Graphical Representation of the Wernick-Geschwind-Model showing re-gions of the brain which are assumed to be of importance for the production ofspeech (Broca’s area, Wernicke’s area and the primary motor cortex) (from [13]).

several strong indications that it is involved in the articulation of words [30]. Wer-nicke’s area is assumed to be necessary for language comprehension. The primarymotor cortex is responsible for planning and executing movements. The lower partof the motor strip (called orofacial motor cortex) is of specific interest since it isresponsible for mouth, tongue, and lips.

It is still a question of ongoing research how the production of speech is actuallyprocessed in the brain. However, these three regions are widely considered to beinvolved in speech production. This is described by the Wernicke-Geschwind-Model,for instance ([31]). This model describes the path of the neural signal when a personlistens a word and then repeats it (see Figure 2.1). After the word is processedin the primary auditory area, the semantics are extracted in the Wernicke’s area.Afterwards, the signals reaches Broca’s area where a plan for the motor cortex isformed which is then implemented in the motor cortex.

Though we know today that this model is oversimplified, it is still the basis for moresophisticated models of how language might be processed. As it was shown by [40],these three regions are of major importance for recognizing unspoken speech.

As pointed out in [9], the precise locations and sizes of the certain brain areas dependon the person and may vary slightly between people. However, the differences areassumed to be relatively small.

Lateralization of Speech Processing

In general, the left hemisphere is assumed to be dominant for language processing([30]). However, this so-called lateralization depends on factors such as gender ([12])

2.2. EEG 7

and handedness [19].

A left-lateralized activation could be shown in males, while females exhibited morebilateral activity during semantic processing of speech. This effect was shown by[14] during the semantic processing of words, which were read by the subject andby [17] where dichotic listening tasks were used.In [19], a correlation between the handedness and hemispheric language domi-nance was shown. They found out that the relation between handedness and right-hemispheric language dominance can be described by the following formula:

right-hemispheric language dominance (%) = 15%− handedness(%)10

where the variable handedness is calculated according to the Edinburgh Inventory([29]). The potential impact of handedness on our recognition results is discussed insection 4.6.

2.2 EEG

2.2.1 Introduction to EEG

Electroencephalography (EEG) measures the electrical activity of the brain withelectrodes. In the following, we will only discuss scalp EEG, meaning that theelectrodes are applied extracranially such that the signals are recorded at the scalp(as opposed to measuring in the brain).

To be more precise, EEG measures the summation of the synchronous activity ofcortical neurons. Specifically, their post-synaptic potentials are recorded. ScalpEEG is believed to be generated by large dipole layers of these synchronized corticalneurons extending over large areas of the cortical surface. The measured scalppotentials are characterized by spatial and temporal features which depend not onlyon the nature and location of the sources, but also on the electrical and geometricalproperties of brain and skull. The signals recorded from a healthy brain typicallyrange between 0Hz and 80Hz in frequency with an amplitude between 0µV and80µV ([33]).

As mentioned before, we concentrate on scalp electrodes as opposed to intracanialelectrodes. The positions for EEG electrodes should be chosen such that all cor-tex regions are covered whose EEG signal might be of interest. There are variousmethods of placing the EEG scalp electrodes, the most common of which is the 10-20system ([18]) for a standard EEG cap of 16 electrodes. However, this does not applyto our cap since we are using a high-density cap of 128 electrodes. A conductive gelis applied between the electrode and the scalp surface in order to reduce both thecontact impedance and noise.

EEG is measured differentially, i.e. the difference between the potential of two elec-trodes is measured (a schema of EEG measurements is shown in Figure 2.2). Thereexist two ways of how the EEG electrodes can be connected to the amplifier: bipolarelectrode montage and common-reference electrode montage. In case of the bipolar

8 2. Background

Figure 2.2: Schema of an EEG measurement (taken from [28]): (a) human brain,(b) section of the cerebral cortex with two nerve cells generating extraceullularpotential, (c) signal measured between two scalp electrodes (potential differenceand corresponding power spectrum).

2.2. EEG 9

electrode montage, the potential difference between neighboring electrodes is mea-sured. In contrast, a common-reference electrode montage measures the potentialdifference between each electrode and a reference point.

The latter montage has been used for this study. Thus, the signal measured by theelectrode is the potential difference between this area of the head and reference elec-trodes. There are also a variety of possible locations for these reference electrodes.We adapted the reference points from [9, 40], where both ear lobes were used asreference points.

The first EEG recordings were conducted in the 1920s. Today, EEG analysis istypically applied to clinical tasks, where it is successfully used as a diagnostic toolfor illnesses affecting the brain. For instance, clinical EEG is used for the detectionof brain tumors, epileptic conditions, and mental retardation, among others. EEGsignals are known to vary with age, gender, handedness, alertness, fatigue, habitu-ation, level of autonomic arousal, use of alcohol, caffeine, drugs or consumption ofnicotine [36].

2.2.2 Assessment of EEG for the given Task

The main advantage and reason for chosing EEG for our study is its high temporalresolution. This is absolutely necessary for detecting patterns in brain activityduring a fast-paced task such as speech production. With EEG, a resolution in theorder of milliseconds can be achieved as opposed to other brain imaging techniquessuch as fMRI which gives about 10 frames/sec. Only magnetoencephalogram (MEG)has a comparable temporal resolution. However, the MEG devices are expensive andbulky, whereas EEG devices can be applied easily and are relatively cheap comparedto other brain monitoring techniques. Moreover, EEG measures the brain activitydirectly.

However, there are several limitations to EEG which have to be taken into accountwhen conducting experiments.

The fundamental question of EEG is how to relate the potentials measured on thesurface of the head to the underlying physiological process in the brain. One ofthe most important limitations is the low spatial resolution of EEG recordings. Anexact localization of the source is not possible due to two factors:

• the distance of the electrodes to the source (neurons in the cortex or otherstructures even deeper in the brain)

• the way electric potentials are propagated in the brain

EEG is most sensitive to potentials which are generated in the outmost layers of thecortex and radial to the skull.

Unfortunately, there are many potential sources of electrical activity on the scalpwhich can have higher amplitude than the EEG data we want to record. Theseadditional sources cause artifacts which corrupt on the signal we want to measure.These artifacts can be divided into two categories: biological and technical artifacts.Biological artifacts are caused by the subject itself and are often a result of otherdipoles in the body which are stronger than the EEG dipole. Examples for biologicalartifacts are

10 2. Background

• facial muscle contractions

• eye or tongue movements

• EKG, i.e. electrical activity of the heart

• electrical activity of the skin, hair, scalp, etc

Furthermore, the psychological situation of the subject and his/her mental statehave a strong influence on his/her EEG data [15].

In contrast, technical artifacts are caused by the EEG recording devices or theenvironment. These are for instance:

• electrodes (shift of the electrodes, noise voltage)

• amplifier (noise, stability)

• movement of cables

• noise from the supply voltage (electromagnetic induction)

• electrostatic artifacts

• environmental condition (heat, humidity, ...)

These sources of artifacts need to be taken into account and eliminated as far aspossible when recordings are conducted.

2.3 Related Work

2.3.1 Brain Computer Interfaces

Investigating EEG based brain computer interfaces (BCIs) has evolved into an in-creasingly active strand of research. Good overviews can be found in [11] and [42],while [21] provides a review of classification algorithms. The aim of BCIs is to trans-late the thoughts or intentions of a given subject into a control signal for operatingdevices such as computers, wheelchairs or prostheses. For instance, locked-in pa-tients have extremely limited possibilities for communicating with their environmentwhich could be facilitated substantially by BCIs. Although one of the main moti-vations for BCI research is to create a communication channel for severely disabledpatients, it could also provide healthy subjects with an additional human-computerinterface. Using a BCI usually requires the user to explicitly manipulate his/herbrain activity which is then used as a control signal for the device [26]. For exam-ple, the case study presented in [25] describes how a completely paralyzed patientlearned how to use an EEG-based BCI for verbal communication by producing twodistinct EEG patterns. This required a learning process over several months. Bycontrast, we focus on the direct recognition of mentally uttered speech with theaim of developping a more intuitive interface which provides the user with a morenatural communication channel.

2.3. Related Work 11

In the following, we will first describe systems whose basis for communication isnot linked to thinking certain words and then turn to systems which attempt torecognize words directly.

A variety of brain computer interfaces based on EEG has been developped, whichdo not attempt to recognize words directly. One approach is to recognize differentuser states with EEG. For instance, in [16], a system is described that uses EEGdata to discrimate between six different mental states of the user. For this task,four electrodes proved to be sufficient. A similar idea is applied in [10], were thedata from four EEG electrodes is used to classify three different mental commands.This classification system aims at enhancing a powered wheelchair for patients withspinal cord injury.

A different approach is directed towards people who are not able to utter speech dueto a physical disability. So far, brain computer interfaces (BCIs) have been developedthat enable these people to give basic commands to a computer, for instance theBerlin Brain Computer Interface (BBCI) [7] or the Thought Translation Device(TTD) [4]. The idea of the latter is to train locked-in patients to control the slowcortical potentials of their EEG such that letters, words or pictograms can be selectedin a program [4]. The BBCI on the other hand is characterized by the fact that notraining is needed, among others [7].

Several systems have been developped which couple a spelling application to a BCIfor mental text entry ([5], [32], [43]). The speed at which this can be done dependson the system used and the training of the user. A maximum speed of 7.6 char/mincould be reached during the presentation of the system ’Hex-o-Spell’ at the CeBIT2006 [6]. None of these systems attempts to recognize words directly, though.

Apart from the systems described so far, there has also been research on systemswhich attempt to recgonized words directly.

First, systems for speech recognition based on electromyographic (EMG) data havebeen developped. Specifically, it has been shown that silent speech can be recog-nized by using EMG data ([22, 23]). Silent speech is defined as speech where onlythe facial muscles are moved but no actual sound is uttered. This is definitely anextremely valuable new tool for communication, solving some of the problems thatwere mentioned in section 1.1 and which motivate this thesis. However, this tech-nique cannot be used by locked-in patients or diseases that prevent articulatorymuscle movements.

Second, it has been shown in [35] that isolated words can be recognized based onEEG and MEG recordings, proving that substantial information about the wordthat is processed is encoded in brain waves. In one of their experimental conditionscalled internal speech, the subjects were shown a single word on a screen and askedto utter this word ’silently’ without using any articulatory muscles. In the workdescribed in this paper we used a similar task. However, our approach differs as[35] used a set of 12 words repeated 100 times each, while we limited the amount oftraining and test data to five words, each repeated 20 times. Furthermore, we didnot use MEG recordings. It should be taken into account that a subject-independentmodel was used for [35] which was built by averaging over half of the data. In alater study [34], it was shown that averaging over subjects improves the recognitionrates of sentences.

12 2. Background

As a short sidenote on the processing of EEG data, wavelet transforms were used forpreprocessing in [24] for the analysis of EEG data. [27] describes the use of HiddenMarkov Models (HMMs) for modelling EEG data. It is shown that HMMs are wellsuited for the automatic detection of sleep apnea. However, the HMMs were appliedto long biological signals whereas we deal with very short signals of a few seconds.

2.3.2 Work directly linked to this Thesis

This thesis leverages off the work of Marek Wester, Jan Calliess, and Michael Wandall of whom had worked on different aspects of EEG-based speech recognition atinterACT1 [9, 39, 40]. Therefore, this thesis constantly refers to their work.

Marek Wester investigated several modalities for EEG-based speech recognition suchas normal speak, whispered speech, silent speech, mumbled speech and unspokenspeech. The focus of this thesis was on the latter modality. Unspoken speech meansthat the subject thinka a given word without moving the articulatory muscles andwithout uttering any audible sound.

Working with a standard EEG-cap, he showed that EEG-based speech recognitionis possible, even at a rate four to five times higher than chance with a slightlyworse outcome for unspoken speech. Also, he could locate the important regionsfor unspoken speech which seemed to be homunculus, Broca’s area, and Wernicke’sarea. Although his work was ground-breaking as it was one of the first on therecognition of unspoken speech based on EEG, there were some drawbacks makingfurther investigations necessary. First, the number of subjects was relatively small:Only 6 subjects were recorded out of whom one had a very high number of recordingsessions with a total recording time of roughly 770 minutes compared to the standard30 minutes recording time. Although different vocabulary domains were used, it canbe assumed that there was a tremendous training effect for this subject. Second, thewords were presented in blocks. However, Jan Calliess showed that this presentationmode produces far better results than any other presentation mode that Jan Calliessexperimented with raising the question whether the blocks mode was affected bytemporal artifacts [9].

In his work, Jan Calliess restricted the number of modalities to spoken and unspokenspeech. His main focus was on comparing the results of Marek Wester produced witha standard EEG-cap with new recordings produced with a high-density EEG cap.Though this high-density cap provided 128 electrodes, only 16 electrodes could berecorded due to constraints by the amplifier; these were chosen above the orofacialmotor cortex. However, no significant difference between the recognition resultsproduced with these two cap layout could be found.

Besides these findings, he experimented with different presentation modes as op-posed to Marek Wester who had almost exclusively worked with block recordings.Jan Calliess introduced what he called mixed blocks (sequential) and mixed blocks(randomized). In the first case, the words were ordered alphabetically and wererepeated 30 times in that order forming a list of the following form:(alpha, bravo, charlie, delta, echo, alpha, bravo, . . .).

1InterACT is the International Center for Advanced Communication Technologies, a joint centerbetween the University of Karlsruhe, Germany, and the Carnegie Mellon University, Pittsburgh,PA, USA. Information can be found at http://interact.ira.uka.de.

http://interact.ira.uka.de

2.3. Related Work 13

In the latter case, the words were completely randomized with the constraint thateach word had to be uttered 30 times.Whereas the homogeneous blocks mode, the recognition results were comparable tothose of Marek Wester, the results were inconclusive in the case of mixed blocks (se-quential) and mixed blocks (randomized). This was also due to the small number ofrecordings which were produced with those presentation modes. Furthermore, JanCalliess assumed that a wrong temporal concept was learned by the recognizer whenit was trained with homogeneous blocks.

Michael Wand investigated the potential of the Wavelet Transform for EEG signalpreprocessing for the recognition of words [39]. In his work, he could show that theDouble-Tree Complex Wavelet Transform (DTCWT) outperforms pure spectral fea-tures considerably, whereas the Fast Fourier Transform did not seem to be suitable.Therefore, the DTCWT was used for preprocessing the data recorded for this thesis.

14 2. Background

3. Experimental Setup

The intention of this work was to further investigate EEG-based speech recognitionand continue the work initiated by Marek Wester, Jan Callies and Michael Wand.

3.1 Recording Setup

All of the experiments were recorded within a month in the same office at theUniversity of Karlsruhe at different times of the day. Besides the test subject, asupervisor was present during all of the experiments. Except for subject 15, therecordings were all supervised by myself. The subject was seated at a desk infront of an empty wall, facing a CRT display which showed the instruction and wasconnected to a laptop. The supervisor was sitting in front of this laptop to controlthe recordings. The supervisor was seated on the right hand side of the subject, ata desk out of sight of the subject. The laptop was used for both the online controlof the experiments and the actual data recording. The setting for the experimentcan be seen in Figure 3.1.

The subject was told that he/she can quit the experiment at any time withoutconsequences. Furthermore, he/she could have as many breaks as wanted. Bothoptions were usually not used. Also, the subject was told that he/she should reportimmediately whenever something went wrong during the recording of one word suchthat this specific recording could be deleted and the recording procedure of this wordcould be repeated.The first trials with an office chair and a canvas chair with a high backrest showedthat if the subject could rest his/her head, the noise in the signals could be reducedsignificantly. This is presumably caused by the fact that the subject has to usethe muscles in his/her neck to keep the head in position, which results in artifacts.Therefore, the canvas chair was used for the experiments starting with the 5thsubject.Initially, it was planned to conduct each of the recordings at approximately thesame time of the day. This was dismissed since it proved to be impractical dueto the limited availability of the subjects and the recording hardware. Instead,every subject was asked for his/her condition at the time of the recording, i.e.

16 3. Experimental Setup

Figure 3.1: Data Recording Setup.

whether he/she was awake, tired or sleepy. It turned out that the condition didnot necessarily depend on the time of the day since all subjects except for onewere students and followed different schedules. This was particularly true sincethe recordings were made during the examination period such that subjects weresometimes more exhausted in the early afternoon (after an exam) compared to theevening (after they had rested).

However, there were acoustic disturbances which were related to the time of the day.Since the office was right next to a heavy fire door that was used rather frequently,many recorded words had to be repeated when the door slammed shut during therecording. This happened much more often during the day than in the evening or atnight. The same was true for people chatting or walking in the corridor. Dependingon the time at which the experiment was conducted, the light had to be switchedon. However, the subject was seated in between the two fluorescent tubes on theceiling such that the artifacts caused by them were minimized.

3.2 EEG Recording Hardware

The hardware used for the experiments was essentially the same as for the workdescribed in [40] and [9] respectively, the difference being that for the recordings, ahigh-density electrode cap was used exclusively.

3.2.1 EEG Cap

For the experiments, an elastic EEG cap was used into which Ag/AgCl electrodesare sewed in. For the initial experiments conducted by Marek Wester, an EEGcap with 20 electrodes was used (fabricated by Electro-Cap International, Inc.) [40].

3.2. EEG Recording Hardware 17

Figure 3.2: Layout of the high-density EEG cap used. The subset of electrodepositions actually recorded are marked by squares. Position 8 and 13 correspond toC3 and C4 (according to the 10-20 system), respectively (taken from [9]).

The consecutive study was done using both the above mentioned cap and a so-calledhigh-density cap by the same manufacturer. For the current study, the latter capwas used exclusively. The cap is equipped with 128 Ag/AgCl electrodes out of whichonly 16 could be recorded simultaneously since we only had an amplifier with 16channels. The selection of the electrodes was based on the experience gained in [9],following the layout shown in Figure 3.2.

As described in [9], the main focus of recording was the area around the orofacialmotor cortex. The only exception are two electrodes. One was placed above the lefteyebrow for the blink detection (electrode 1 in Figure 3.2). The other electrode waslocated as far away from the motor strip as possible but still picking up a large partof signals from Broca’s area of the left cortical hemisphere (electrode 2 in Figure 3.2).The left hemisphere was chosen as it is reported to be dominant in general for speechproduction. However, this so-called lateralization depends on factors such as genderand handedness. Thus, this electrode placement was not ideal for all subjects.More information about the relation of handedness and lateralization can be foundin section 2.1.2 whereas more information about the impact of handedness in ourstudy is provided in section 4.6.

Using the high-density electrode cap, we faced the same challenges as described in[9]: It was mainly the extensive wiring inside of the cap that made it difficult toestablish a secure contact between the scalp and the electrodes. Also, there werestill no sponge discs available in the right size for fixing the cap at the foreheadof the subject. Furthermore, we faced the problem that the weight of the cablesemanating from the electrode cap pulled the head of the subject to the back. Thiscaused muscles fatigue in the neck after some time as well as a dislocation of the cap.


Figure 3.3: Subject with EEG cap on a canvas chair. The amplifier and theVarioPortTM can be seen on the lefthand side of the chair.

A/D Conversion 12 BitInput Range ± 450 µVResolution 0.22 V / BitFrequency Range 0.9-60 HzAmplification Factor 2775Input Channels 16

Table 3.1: Technical specifications of VarioPortTM.

Therefore, a system of rods was used initially to support the cables. The problemwas solved by using a canvas chair where the cables could be rested on the highbackrest (see figure 3.3).

3.2.2 VarioPortTM

For the recordings, the VarioPortTM was used as amplifier/ADC and recording de-vice. The specifications of the amplifier can be found in Table 3.1 (for more details,refer to [3]). All recordings were conducted at a sampling rate of 300Hz. TheVarioPortTM was connected to a laptop via an interface which in turn was linkedto one of the laptop’s USB-ports using an RS-232 to USB adaptor. In order tominimize interference, the interface and the device were linked with fiberglass cable.All recordings were done under Windows XP.

3.3. Processing and Classification Software 19

3.3 Processing and Classification Software

3.3.1 UKA EMG/EEG Studio

For the recordings, UKA EMG/EEG Studio software was used. The purpose of thissoftware is to record the amplified and digitized data, to show instructions to thesubject and to provide control options for the supervisor. This program had beenused before for the work described in [9] and [40] and was adapted for the currentrecordings.

The control options were augmented such that the supervisor could not only selectthe word list which should be presented to the subject but also a so-called pauselist. This was needed for the variation of the variable ’Length of Breaks’. The firstidea was that after each word that had been recorded the length of the succedingbreak was shown to the subject. However, initial trials showed that this was coun-terproductive as the subjects’ concentration was disturbed by the announcement ofthe break. Therefore, in the succeding recordings the length of the break was onlyvisible to the supervisor.

This concentration issue also gave rise to another change of the recording softwarewhich was not related to the recording mode.First, the previously used software showed the words ’Inhale and Exhale’ betweendisplaying the word to be thought and the start of the recording of that word. Thisalso disturbed the concentration of the subjects. Therefore, instead of displayingthose words, a blue screen was shown to the subject as an indicator that the recordingwas about to start in two seconds.

Second, the older version of the recording software showed a small window to thesubject where the words to be thought were shown in the upper part and both abig control button and a button for repeating the last word were shown in the lowerpart. The supervisor had to click on these buttons in order to control the recording.This cursor movement was visible to the subject. Furthermore, the control buttonchanged the color during the different phases of the recording. These facts distractedthe subjects significantly.

Due to the feedback of the first subjects, the buttons were made visible only in thecontrol windows of the supervisor. At first, this resulted in confused subjects sincethey lacked feedback on whether the recording had started already or when it hadstopped. When using the previous software, they had used the color change of thebutton for orientation. Therefore, the different phases of the recording were color-coded (see figure 3.4) such that a blue screen indicated the concentration phase, awhite screen was shown only during the recording with a succeding black screen asa signal for a short break between words.

For further recordings, it should be taken into consideration that concentratingwould be easier for the subjects if a cross or a dot were shown instead of a whitescreen during the recording of the word. This was suggested by several subjectsindependently of each other at the end of the recordings.

3.3.2 JANUS

The classification of the data was done with the Janus recognition toolkit (Jrtk), aspeech recognition system developed at the University of Karlsruhe and Carnegie


Mellon University [20, 38]. For the current work, a recognition system was usedwhich was developed by Marek Wester and Michael Wand during the course of theirwork. This current system was derived from a state-of-the-art speech recognizer forthe needs of EEG-based silent speech recognition. Word segmentation was basedon the eye blinks with which the subjects marked the beginning and the end of thethinking process for each word. These eye blinks resulted in high spikes in the EEGsignal that could be distinguished from the signal during thinking or silence whichboth had rather low amplitudes. For segmentation, the automatic eye blink detectorwas used which was developed by Marek Wester and refined by Jan Calliess (see [40]and [9] for more details on eye blink detection).

In Janus, every word was modeled by a left-to-right Hidden Markov Model (HMM).Both the number of states and the number of Gaussians per state were varied duringthe experiments for which only a characteristic subset of the data was chosen (seesection 4.4). The standard HMM that was used for training on all of the data hadfive states and one Gaussian per state.This work was restricted to offline recognition. For training, the Expectation Max-imization algorithm was used with four iterations. After the Viterbi path had beencomputed for each word, the one with the best score was chosen.For training and testing, a round robin scheme was used for as many times as eachsingle word was recorded (i.e. 20 times in this work). In each round i (i ∈ {1, ..., 20}),one sample of each word was left out of the training procedure and used for testing.Thus, the test set for each round consisted of 95 samples and the training set of 5samples (one per word in the vocabulary). Each round resulted in a percentage ci

of how many of these 5 samples in the test set were recognized correctly (ci =100%if all 5 samples were recognized correctly, ci =0% if none of them was recognizedcorrectly). After all rounds were finished, the final recognition rate was computedby summing up the percentages from all rounds and dividing this number by 20.Thus, the recognition rate is the average likelihood that the system could recognizea word correctly from the given EEG data. This can be written as follows:

recognition rate (%) =∑

ici

20

with i ∈ {1, ..., 20},ci ∈{0%, 20%, 40%, 60%, 80%,100%}

For the cross session experiments, a different training/testing scheme was used (referto section 4.3).

The preprocessing was based on the findings of Michael Wand [39] who showedthat the use of the Double-Tree Complex Wavelet Transform (DTCWT) leads toa significant improvement of the word recognition rate for EEG signals. For theexperiments, all 16 input channels were used. A Linear Discriminant Analysis (LDA)was applied to the feature vectors. According to [39], a decomposition level of 3 ledto the best recognition results which was also chosen here. This yielded an initialfeature vector with 96 dimensions (= 3 · 2 · 16).

3.4 Corpus Production

3.4.1 Vocabulary Domain

As in [9], the so-called alpha vocabulary domain was used, consisting of the first fivewords of the international radiotelephony spelling alphabet (NATO alphabet), i.e.

3.4. Corpus Production 21

{alpha, bravo, charlie, delta, echo}. These were also used in [40] where a variety ofother domains was tried as well. Since this was the only vocabulary domain usedin [9] and we wanted the recordings of the current work to be comparable to therecordings made for [9], we chose the same vocabulary domain. As it is pointed outin [9], this vocabulary domain has the following advantages. The words

• are easily distinguished when spoken,

• do not have a familiar semantic meaning,

• do not fall into different semantic categories.

However, this was not always the case (see section 4.7).

Besides the actual vocabulary, ’silence’ was recorded 5 times per session as a baseline.’Silence’ means that the subject was shown the characters ’[. . . ]’ and was asked totry to think of nothing when this was displayed or at least not to think of any ofthe five words of the alpha domain. In the experiments described in [40], the word’silence’ itself was shown on the screen. During the recordings done for this thesis,this proved to be unsuitable: The subjects reported that they tended to think theword ’silence’ which was not intended. For this reason, ’[. . . ]’ was chosen in ordernot to evoke associations to a certain word.

3.4.2 Data Acquisition

During each of the recordings, a test protocol was filled out by the supervisor. In thisprotocol, both data from the subject and information about the recording sessionswere listed: Unusual signals were recorded as well as changes in the circumstancessuch as disturbances, change of light etc.

As in [40] and [9], each recording followed a certain protocol. It is important to takeinto consideration that the protocol of the recordings of this work varies in somerespects from those in previous works. The basic building block of each recording isa so-called word list which is defined as follows.

Definition. A word list W (B) of length n is defined as an ordered set of pairs

((wi, bi))i∈1,..,n

where wi ∈ {alpha, bravo, charlie, delta, echo, [. . .]} and bi ∈ {2sec, 10sec} is thelength of the succeding break. The type of the break is determined byB ∈ {short, long, shortORlong} for all breaks bi of the word list:

W(B) =

((wi, 2sec))i∈1,..,n if B = short((wi, 10sec))i∈1,..,n if B = long((wi, {2sec ∨ 10sec}))i∈1,..,n if B= shortORlong

This definition differs from the definition given in [9] since the modality mi of howthe word wi should be uttered was replaced by the length bi of the break.

Whereas Jan Calliess [9] used the modalities spoken and unspoken and thus variedmi, we concentrated on unspoken speech but varied the pause inbetween the wordsand the order of the words.


Unless otherwise stated, the value of n was set to 105 since each of the 5 wordswas repeated 20 times and ’silence’, represented by ’[. . .]’, was recorded 5 times asbaseline. Therefore, a session can be seen as a sequence of n = 105 steps where asession is defined as follows.

Definition. A session S is defined as a tuple

S = (W (B), O)

where W (B) is a word list of length n with break type B andO ∈ {blocks, shortBlocks, sequential, randomized} is the order in which the wordsfrom W (B) are presented.

Finally, a recording is composed of various sessions.

Definition. A recording R(p) of a subject with ID p is defined as an ordered set ofsize m

R(p) = (S1, S2, ..., Sm)

where Si is a session (i ∈ 1, ...,m).

A standard recording R(p) for the current thesis consisted of three sessions (m = 3,i.e. R(p) = (S1, S2, S3)). This was only changed in case that technical problemsprevented the recording of further sessions or in case that the subject asked to stopthe recordings. Between the sessions, the subject was given a break of at least10 minutes. The length of the rest was determined by the subject, but usuallydid not exceed 30 minutes. Each subject p did only one recording R(p) to avoidtraining effects. The only exception was subject 05 for whom three recordings weretaken with the aim to vary both break length and word order. For this subject,the sessions are numbered (S1, S2, S3, S4, S5, S6, S7, S8, S9) but were taken in threerecording blocks, each consisting of three consecutive sessions.

To sum this up, a standard recording for subject p would look like the following

R(p) = (S1, S2, S3)

= ((W1(B1), O1), (W2(B2), O2), (W3(B3), O3))

For each of the three sessions, a word list of length 105 was used. This in turn meansthat each word of the alpha vocabulary domain {alpha, bravo, charlie, delta, echo}was repeated 20 times per session and ’silence’ was recorded 5 times as a baseline.

3.4.3 Variation of Presentation Mode

Two variables were varied during the recording sessions (with i ∈ {1, 2, 3}):

• the length Bi of the breaks between the words and

• word order Oi of the word list.


Word OrderBreak Type

Short Long Short OR Long

Blocks (alphabetical) 5 2 4Reordered Blocks 5 0 6Short Blocks 5 0 5Randomized 8 2 10Sequential 6 3 6

Table 3.2: Total number of sessions in the database for different word orders anddifferent break types.

Word Order # Subjects # Sessions

Blocks (alphabetical) 7 11Reordered Blocks 11 11Short Blocks 10 10Randomized 16 20Sequential 9 15

Table 3.3: Total number of subjects and sessions in the database for different wordorders.

In table 3.2, an overview is given over all the sessions that were recorded. Addition-ally, the number of subjects per word order can be found in table 3.3 whereas thenumber of subjects per break type are given in table 3.4. The database only includesthe recordings that were actually used for the experiments (excluding subject 01 and19, refer to section 3.5.1 for reasons).

The detailed results of these recordings can be found in the Appendix B.

A counterbalanced within subjects design was used for the experiments. This meansthat the order in which the different modes (either variation of breaks or variation ofword order) were presented was varied in order to avoid side effects, e.g. due to thefact that a certain mode was only presented to subjects after they had been trainedwith different modes before.

3.4.3.1 Variation of Breaks

During the first part of the experiments, the words were presented to a given subject(subject ID p) in a constant word order during each session (p ∈ {1, ..., 6}).

O1 = O2 = O3 ∈ {blocks, sequential, randomized}

However, the length bi of the breaks was varied between sessions such that for eachsubject, a session with breaks of 2 seconds (short), 10 seconds (long) or with breaksof either 2 or 10 seconds (short or long) was recorded. In the latter case, each ofthe breaks had the same probability (equal overall distribution) such that half ofthe breaks were short and half of them were long. For an example of a recordingprotocol, see below:

R(p) = ((W1(shortOrLong), O1), (W2(short), O2), (W3(long), O3))

with O1 = O2 = O3.


Break Type # Subjects # Sessions

Short 13 29Long 5 7

Short OR Long 13 31

Table 3.4: Total number of subjects and sessions in the database for different breaktypes.

3.4.3.2 Variation of Word Order

During the second part of the experiments, the words were presented in differentorders while the length of the breaks stayed the same. The break type long was dis-missed since it would have resulted in recordings that were too long for the subjects.This resulted in recordings of the following form, for instance:

R(p) = ((W1(B1), randomized), (W2(B2), blocks), (W3(B3), sequential))

with B1 = B2 = B3 ∈ {short, shortOrLong}

For the second part of the experiments, the following word orders were used:{blocks, blocksReordered, shortblocks, sequential, randomized}Furthermore, silence was only recorded 4 times per session. Since the different wordorders are decisive for the understanding of the experiments conducted for this work,they are described in the following.

The word order blocks was used both in [9] and [40]. In [9], other experiments aredescribed with sequential or randomized word order which are called mixed blocks(sequential) and mixed blocks (randomized) there.

If the words were presented in blocks, they were presented to the subjects in blocksof 20 words, i.e. the list looked as follows:

∀i ∈ {1, ..., 20} : (wi, bi) = (alpha, bi)

((w21, bi) = ([. . .], bi)

∀i ∈ {22, ..., 41} : (wi, bi) = (bravo, bi)

((w42, bi) = ([. . .], bi)

∀i ∈ {43, ..., 62} : (wi, bi) = (charlie, bi)

((w63, bi) = ([. . .], bi)

∀i ∈ {64, ..., 83} : (wi, bi) = (delta, bi)

((w84, bi) = ([. . .], bi)

∀i ∈ {85, ..., 104} : (wi, bi) = (echo, bi)

The size of the blocks used for these experiments differs from the size of the blocksrecorded by Jan Calliess and Marek Wester who used 30 words per block. This wasnot possible for our recordings, since we recorded 3 sessions per subject in a rowin order to keep the sessions comparable. 30 repetitions would have resulted in asignificantly longer overall recording time.


In order to prevent side effects due to the alphabetical order of these blocks, the orderof the blocks was randomized for some subjects. This is referred to as blocksReordered.An example can be seen below:

∀i ∈ {1, ..., 20} : (wi, bi) = (delta, bi)

((w21, bi) = ([. . .], bi)

∀i ∈ {22, ..., 41} : (wi, bi) = (alpha, bi)

((w42, bi) = ([. . .], bi)

∀i ∈ {43, ..., 62} : (wi, bi) = (charlie, bi)

((w63, bi) = ([. . .], bi)

∀i ∈ {64, ..., 83} : (wi, bi) = (echo, bi)

((w84, bi) = ([. . .], bi)

∀i ∈ {85, ..., 104} : (wi, bi) = (bravo, bi)

ShortBlocks means that the words were presented in a similar way as in blocks, i.e.blocks of the same words were presented. However, the blocks were smaller andconsisted only of 5 repetitions of each word. Furthermore, the blocks were repeatedin alphabetical order. This resulted in the following list of words:

((w1, bi) = ([. . .], bi)

∀i ∈ {2, ..., 6} : (wi, bi) = (alpha, bi)

∀i ∈ {7, ..., 11} : (wi, bi) = (bravo, bi)

∀i ∈ {12, ..., 16} : (wi, bi) = (charlie, bi)

∀i ∈ {17, ..., 21} : (wi, bi) = (delta, bi)

∀i ∈ {22, ..., 26} : (wi, bi) = (echo, bi)

((w27, bi) = ([. . .], bi)

∀i ∈ {28, ..., 32} : (wi, bi) = (alpha, bi)

∀i ∈ {33, ..., 37} : (wi, bi) = (bravo, bi)

. . .

∀i ∈ {95, ..., 99} : (wi, bi) = (delta, bi)

∀i ∈ {100, ..., 104} : (wi, bi) = (echo, bi)

If words were presented in sequential word order, they were listed in alphabeticalorder {alpha, bravo, charlie, delta, echo}. This quintuple was then repeated 20times.

Randomized means that the words were shown to the subject in random order,subject only to the constraint that each word occurred 20 times.


3.5 Recording Procedure

3.5.1 Subjects

Data was obtained from 23 subjects out of which 7 were female and 16 were male.However, only the data from 21 subjects was actually used (resulting in 6 femaleand 15 male subjects in the database): Since subject 01 did not have the sameconditions as the other subjects, this data was excluded. Similarly, the data fromsubject 19 was not used either since the subject had to take antibiotics after a smallsurgery resulting in EEG data which could not be used. Therefore, I will only referto these subjects and their recognition rates in the rest of the thesis. Data aboutthe subjects of this study is provided in Figure 3.5. The number codes for theirstate at the beginning of the recording are given in table 3.5. An overview overthe experimental setup for each of the subjects and their corresponding recognitionrates is provided in Figure 3.6. For subjects 01 to 07, the break type was variedwhereas for all other subjects, it was kept fix whereas the word order was changedbetween sessions.

Current State Code Meaning

Drugs0 none1 evening before2 a few hours before the recording

Health0 healthy1 small impairment2 huge impairment

Alertness0 excited1 awake2 sleepy

Nervousness0 not nervous1 a little nervous2 very nervous

Table 3.5: Codes for subjects’ state in Figure 3.5. The term drugs includes alcoholand medication.

The subjects were all graduate students between 21 and 26 years, except for one whowas significantly older (53 years). The average age was 24.5 years. For the females,the average was 28.3 years, ranging from 21 to 53. For the males, the average agewas 22.9 with a range from 21 to 26. Except for two people (subject 05 and 11), alltest subjects studied a technical subject.

The subjects were mainly native German speakers, except for two Hungarians andone Bulgarian. One subject had spent its’ early childhood in Romania but can beconsidered a native German speaker. All of the subjects were fluent in German andpursued their studies in German. All subjects had normal or corrected-to-normalvision. Participation in the experiment was voluntary and not paid. All subjectsgave informed consent before the experiment was started.

In [19], a correlation between the handedness and hemispheric language dominancewas shown, based on the Edinburgh Index Inventory [29]. Since we wanted to use

3.5. Recording Procedure 27

this finding in our work, we used the same inventory for assessing the handedness ofthe subjects. The subjects were asked to fill out the 10-item Edinburgh InventoryQuestionaire online [1] and report the result.Though the Edinburgh Inventory Index is a widely used standard to assess the degreeof handedness of subjects, other standard inventories such as the Annett Inventorycould have been used. As mentioned, we wanted to keep our results comparable tothose of [19] and therefore chose the Edinburgh Inventory. However, it has to betaken into account that it has been shown that the Edinburgh Handedness Inventorytends to yield more either-hand and fewer left-hand responses than does the Annettquestionnaire [41].

3.5.2 Preliminary Procedure

First, each subject was given a detailed, written instruction of the upcoming ex-periment (see Appendix A) in order to guarantee at least a similar initial state ofinformation. After having read the instruction, each subject had the option to askquestions before and during the preparation of the experiment. The subject wastold that it could interrupt or quit the experiment at any given point during therecording in case of further questions, fatigue or other concerns. After these initialinstructions, the supervisor collected statistics as a part of the test protocol (seeAppendix A). In contrast to the experiments described in [9], these steps were con-ducted before the subject was outfitted with the cap in order to minimize the actualtime the subject had to wear the cap. This was essential since the experiments weremuch longer than the ones in [9] such that the cap had to be worn for a longer time,eventually resulting in discomfort due to the pressure on the scalp.

Then, the subject was outfitted with both the cap and the ear electrodes. Theelectrodes were filled with a conductive paste while the supervisor controlled thesignal quality.

After the signals had reached a satisfying level, the subject was asked to blink whilewatching its EEG signals such that it could get direct visual feedback on how toproduce a ’correct’ blinking signal. Subsequently, a short test session was recordedconsisting of one recording for each word, such that the subject could get used tothe recording procedure.

Before the start of the recording each subject signed an informed consent. Withtheir signature, they confirmed that they had read and understood the informa-tion carefully and that their questions had been answered to their full satisfaction,agreeing that the data recorded during the experiment may be used and publishedanonymously for scientific purposes.

3.5.3 Recording

Each of these steps consisted of four phases as shown in Figure 3.4 and had thefollowing structure (description for step i). In phase 1, the word wi was shown tothe subject for two seconds (black font on white background). Subsequently, thescreen turned blue without showing any word (phase 2 ). When the screen turnedwhite again after 2 seconds, the recording phase started (phase 3 ). The subject hadthe instruction to do the following in this phase:


Liebe Probandin, lieber Proband! In diesem Experiment wird untersucht, wie das menschliche Gehirn Sprache verarbeitet. Da jede Bewegung während der Aufzeichnung zu einer Verfälschung der Daten führt, sollten Sie während des Experiments versuchen, sich möglichst wenig zu bewegen. Das gilt vor allem während der Aufnahme selbst, also wenn das entsprechende Wort gedacht wird (Phase 3, sh unten). Im Rahmen des Experiments werden folgende Wörter präsentiert, die gedacht werden sollen: alpha, bravo, charlie, delta, echo Jedes dieser Wörter wird 20mal wiederholt Außerdem wird ab und zu „[…]“ angezeigt, wobei möglichst nichts gedacht werden sollte (zumindest keines der fünf oben genannten Wörter). Die Abfolge für jedes Wort sieht folgendermaßen aus:

In Phase 1 wird das Wort zuerst angezeigt, dann erscheint ein blauer Bildschirm (Phase 2). Wenn danach in Phase 3 der weiße Bildschirm erscheint, tun Sie bitte Folgendes:

1. zuerst blinzeln 2. das vorher angezeigte Wort denken 3. danach wieder blinzeln

Das Blinzeln dient zum späteren Segmentieren der Wörter und wird aus der Aufnahmen herausgeschnitten. Daher sollten Sie das Wort erst anfangen zu denken, wenn das Blinzeln abgeschlossen ist. Wenn Sie das Wort denken, ist es wichtig, dass Sie sich vorstellen, dieses Wort auszusprechen. Dabei sollten Sie jedoch nicht Ihre Zunge oder Ihre Gesichtsmuskulatur bewegen. Neben Körperbewegungen ist es auch wichtig Augenbewegungen zu vermeiden. Solange der Bildschirm weiß ist, läuft die Aufnahme. Sobald diese gestoppt wird, wird der Bildschirm schwarz und es gibt eine kurze Pause von einigen Sekunden (Phase 4). Wird der Bildschirm zu früh schwarz (etwa, bevor das zweite Blinzeln erfolgt ist) oder zu spät (wenn das Wort längst gedacht ist und Sie sich eventuell schon bewegt haben), sagen Sie bitte der Versuchsleiterin Bescheid. Gleiches gilt, wenn Sie sich „verdacht“ haben. Dann kann die Aufnahme dieses Wortes gelöscht und wiederholt werden. Wenn Sie zwischenzeitlich eine längere Pause benötigen, dann melden Sie sich bitte ebenfalls. Vor Beginn des Experiments gibt es zunächst einen kurzen Übungsdurchgang, der dazu dient, dass Sie sich mit dem Ablauf des nachfolgenden Teils vertraut machen können. Insgesamt gibt es 3 Aufnahmeblöcke, zwischen denen es jeweils längere Pausen von ca. 10min gibt. In jedem Aufnahmeblock werden Ihnen die fünf Wörter in unterschiedlichen Reihenfolgen präsentiert. Sollten Sie noch Fragen zu dem Experiment haben, wenden Sie sich bitte an die Versuchsleiterin. Vielen Dank für Ihre Mithilfe bei diesem Experiment!

alpha

2 sec

phase 1: read

2 sec

phase 2: concen-

trate

2 sec

phase 4: break

alpha

…

phase 3: blink – think – blink

(recording)

Figure 3.4: Phases of one recording step.

• Blink with both eyes

• Think the word that it had been shown in phase 1

• Blink again with both eyes

The subject was asked to imagine speaking the word without moving neither thetongue nor facial muscles. Also, it was pointed out that eye movement should beavoided and of course, any other movement of the body. Furthermore, the subjectwas reminded that it was important to replicate the thinking process for each of thewords as accurately as possible. The supervisor had to make sure that the subjectblinked exactly twice during each step. As long as data was recorded the screenstayed white. When the supervisor detected the second eye blink in the signals, shestopped the recording immediately. As a visual feedback for the subject, the screenturned black (phase 4 ), signaling the end of the recording. The length of phase 4was determined by bi, the length of the break.

3.5. Recording Procedure 29

Spe

aker

IDS

exag

eH

ande

dnes

sE

dinb

urgh

Inde

xN

atio

nalit

yS

tudi

esD

rug

sH

ealth

Ale

rtne

ssN

ervo

usn

ess

02m

25R

100

Ger

man

Mec

hEng

-0

20

03m

21R

100

Hun

garia

nC

S-

--

-04

m21

R10

0H

unga

rian

CS

00

0.5

01

02

01

02

00

01

006

m21

R85

.7G

erm

anM

echE

ng0

02

007

m21

Run

kow

nG

erm

an-

00

10

08m

22R

75G

erm

anC

ivilE

ng0

01.

50

09f

25R

100

Ger

man

Civ

ilEng

00

10

10m

25L

-89.

5G

erm

anE

E0

00

111

f24

R88

.2R

oman

ian

Ger

man

10

20

12m

22R

100

Ger

man

IndE

ng2

02

113

m23

L-7

8.9

Ger

man

IndE

ng0

01

014

m25

R10

0B

ulga

rian

CS

00

1.5

115

f24

A-1

1.1

Ger

man

CS

10

20

16f

21R

75G

erm

anIn

dEng

00

21

17m

26R

100

Ger

man

Phy

sics

00

10

18m

21A

55.6

Ger

man

EE

10

20

20m

24R

81.8

Ger

man

Mec

hEng

10

11

21f

23R

84.6

Ger

man

IndE

ng0

01.

50.

522

m24

R68

.4G

erm

anIn

dEng

20

20

23m

23A

unko

wn

Ger

man

EE

10

20

all

24.4

762

fem

ale

28.3

333

sex

F

12 11S

exag

eav

erag

e10

f21

20f

2315

-05

f53

R10

0G

erm

an

Cur

rent

Sta

teG

ener

al In

form

atio

n

Figure 3.5: Overview over the subjects. Handedness is given as L (left-handed), R(right-handed) or A (ambidextrous). These were the answers given by the subjectsbefore they took the Edinburgh test. The explanation for the number codes (in thecolumn ’current state’) can be found in table 3.5. The data from subject 01 andsubject 19 was not used (for explanations, refer to section 3.5.1).


short long shortORlong short long shortORlong01 blocks 01 - - 72.80% - -02 sequential 01 02 03 19.20% 13.00% 14.00%03 randomized 03 01 02 20.00% 22.50% 22.00%04 blocks 02 01 03 31.50% 37.00% 34.00%

blocks 01 03 02 56.00% 36.25% 42.00%randomized 05 06 04 26.00% 21.00% 20.25%sequential 08 07 09 27.00% 18.00% 20.00%

06 sequential 03 02 01 12.00% 14.40% 15.20%blocks 01 - - 52.00% - -randomized 02 - - 25.25% - -sequential 03 - - 18.00% - -sequential - - 01 - - 13.00%blocks - - 02 - - 68.00%randomized - - 03 - - 18.00%sequential 01 - - 18.00% - -blocks 02 - - 55.00% - -randomized 03 - - 21.25% - -blocks 01 - - 41.00% - -sequential 02 - - 34.00% - -randomized 03 - - 18.00% - -randomized - - 01 - - 19.00%sequential - - 02 - - 18.95%blocks - - 03 - - 52.73%randomized - - 01 - - 18.00%blocks ECBDA - - 02 - - 49.00%sequential - - 03 - - 17.00%blocks ECBDA 01 - - 44.00% - -shortBlocks 02 - - 21.00% - -randomized 03 - - 15.38% - -randomized - - 01 - - 24.00%shortBlocks - - 02 - - 17.00%blocks DACBE - - 03 - - 41.00%blocks BEACD 02 - - 33.00% - -shortBlocks 03 - - 22.00% - -blocks EDABC 01 - - 53.00% - -randomized 02 - - 16.00% - -shortBlocks 03 - - 31.00% - -shortBlocks 01 - - 27.00% - -blocks CADBE 02 - - 61.75% - -randomized 03 - - 18.00% - -shortBlocks - 01 - - 27.00%randomized - 02 - - 10.00%blocks EBDAC - 03 - - 35.00%randomized 01 - - 16.00% - -shortBlocks 02 - - 14.00% - -blocks ACEDB 03 - - 20.00% - -shortBlocks 01 - - 30.00% - -blocks CABED 02 - - 50.00% - -blocks ADECB - - 01 - - 64.00%shortBlocks - - 02 - - 18.00%randomized - - 03 - - 17.00%randomized - - 01 - - 18.00%blocks DBCAE - - 02 - - 34.29%shortBlocks - - 03 - - 13.00%shortBlocks - - 01 - - 15.00%blocks BECDA - - 02 - - 30.48%randomized - - 03 - - 20.00%

Speaker ID Word Order

20

19

15

23

21

22

13

08

05

10

17

18

Session ID Break type

16

14

07

12

11

09

Figure 3.6: List of all recordings in the database. The letters behind ’blocks’ providethe order in which the blocks were presented, e.g. ’ECBDA’ siginfies the order (echo,charlie, bravo, delta, alpha). These special blocks are called blocksReordered in therest of the thesis. If nothing is written after the word ’blocks’, they were recordedin alphabetical order. The data from subject 01 and subject 19 was not used (forexplanations, refer to section 3.5.1).

4. Experiments and Results

Let us recall the aim of this thesis that is to verify, which one of the followinghypotheses is true:


Hypothesis B. Unspoken speech cannot be recognized based on EEG signals. Thegood recognition results of [40] were due to temporal patterns that were recognizedinstead of words.

The succeeding experiments were conducted in order to answer this question.

The experiments can be divided into multiple parts, the first of which was based onvariations in the presentation mode during recordings (see also section 3.4.3). Asmentioned before, in [40] a presentation mode was used that is vulnerable to tempo-ral artifacts. Though Jan Calliess hypothesized in his thesis that the presentationmode influences the recognition rates, he did not have enough data to investigatethis idea in depth. Thus, it was crucial to produce a significant amount of data. Theaim was to record three sessions per subject without removing the EEG cap, butwith a different word order per session. In this way, the sessions of one subject couldbe compared directly in order to investigate the influence of the word order. If allword orders except for block delivered low recognition rates, this would strengthenhypothesis B.

Second, cross session experiments were conducted, i.e. we trained the HMM on onesession and tested on a different session from the same subject. The good recognitionrates for the block presentation mode could be a result of the fact that this wordorder made it simpler for the subjects to concentrate and think the words in asimilar manner. However, if this was true, the recognition results should becomemuch better if we trained on the recordings of a session with block word order andtested on a different session of the same subject where a different word order wasused. This would then support hypothesis A.

The third part of the experiments focused on variations of the word modeling inJANUS. This was not linked directly to either of the hypotheses. However, our

32 4. Experiments and Results

recognition results for recordings with block word order were worse than those pre-sented in [9, 40]. At the same time, we had made changes to both the recordingssoftware and the preprocessing compared to [9, 40]. Thus, we wanted to make surethat we used an appropriate word model in JANUS and therefore varied the HMM.

Subsequently, we examined the recorded EEG data more closely, trying to eliminatethe noise and analysing the length of the recorded utterances. One reason for thebetter performance rate on block recordings could be that this data simply containsless noise and had less variations in length for the reasons mentioned above. So ifwe eliminated utterances which were presumably corrupted, the sessions with wordorders other than blocks should benefit more than blocks. This would then supporthypothesis A.

Finally, we investigated the impact of handedness on the recognition rate. Thiswas again linked to the better recognition results which were yielded in [9, 40], butnot to the two hypotheses. Since in those two studies, fewer subjects were used,we suspected that there may have been a factor other than word order (such ashandedness) which those subjects had in common and which influenced the studies.

All the data which is plotted here in graphs, can be found in Appendix B. For moredetails on the training and recognition itself, refer to section 3.3.2.

4.1 Variation of Break Length

4.1.1 Initial Experiments with Break Length

The initial experiments with varying break length showed no significant difference forthe recognition rate which is why we ended these experiments after some recordings.In total, we recorded 21 sessions (7 recordings) with 5 subjects, one of which wasavailable for 3 recordings. The recording rates which were yielded can be found inFigure 3.6 (subjects 01 to 07). Across subjects, the word order was varied betweenblocks, randomized and sequential which resulted in a wide range of recognition rates(between 12% and 56%) as can be seen in Figure 4.1. These differences dependingon the word order are adressed in section 4.2.

In 4 recordings out of which 3 were from the same subject, short breaks yieldedthe best recognition result. For the others, it delivered the worst results. It seemsthat the variation is based on personal preference instead of following a consistentpattern. However, when using short breaks, the recognition rate was 27.39% onaverage which is indeed slightly higher than for short-OR-long (23.92%) or longbreaks (23.11%).

There was the general tendency for subjects to like short breaks, followed by breakswith shortORlong length. In contrast, they did not like the break type long. Thesebreaks were perceived as being too long, thus making it difficult to keep up concen-tration. Also, subjects tended to become bored while waiting for the next word.

For these reasons and since no significant difference in the recognition rate could bedetected, this break type was dismissed for later recordings. Another reason wasthat this break type simply leads to very long recording sessions and it would hardlyhave been feasible to record 3 sessions in a row with break type long : Each would

4.1. Variation of Break Length 33

subject short shortORlong long03 (random) 20.00% 22.00% 22.50%05 (random) 26.00% 20.25% 21.00%02 (sequ) 19.20% 14.00% 13.00%06 (sequ) 12.00% 15.20% 14.00%05 (sequ) 27.00% 20.00% 18.00%04 (blocks) 31.50% 34.00% 37.00%05 (blocks) 56.00% 42.00% 36.25%average 27.39% 23.92% 23.11%

Overview Results

1) Variation of Break Order

Variation of Break Length

0%

10%

20%

30%

40%

50%

60%

short shortORlong long

Break Type

Rec

ogni

tion

Rat

e

03 (random)05 (random)

02 (sequ)06 (sequ)

05 (sequ)04 (blocks)05 (blocks)

average

Figure 4.1: Recognition rate depending on break length for subjects p ∈{02, 03, 04, 05, 06}.

have lasted about 40min, totalling in a recording time of at least 140min (340minsessions +220min break) which is simply too long for the subjects.

Thus, for the experiments on word order, only the break types short and shortOR-long were used.

4.1.2 Break Length Revisited

When all the data was compared that had been recorded, it showed that short breaksyielded slightly better results than shortORlong break lengths for all word orders.This effect was most significant for the word order shortBlocks. However, the ef-fects of the word order on the recognition rate were the same for both short andshortORlong break lengths. This is different for long breaks (see Figure 4.2). Anoverview over the recognition rates depending on the break type is given in table 4.1.

It has to be taken into account though, that 32 sessions were recorded with shortbreaks and 31 sessions with shortORlong break length, whereas only 7 sessions wererecorded with long breaks. The average recognition rates of long breaks are basedon the data of 3 subjects only which is not representative. More data would beneeded to assess this in more depth.

Break Type Average Min Max

short 30.91 12.00 61.75long 23.16 13.00 37.00shortORlong 26.61 10.00 68.00

Table 4.1: Overview over the recognition rates (%) depending on the break type.


Variation of Break Length Revisited

0%

10%

20%

30%

40%

50%

60%

blocks randomized sequential shortBlocks

Word Order

Rec

ogni

tion

Rat

e short break

long break

shortORlongbreak

Figure 4.2: Recognition rate depending on break length, averaged over all the record-ings.

4.2 Variation of Word Order

In the second part of the experiments, the word order was varied within subjectswhile the break length was varied between subjects. An overview over the recordingsmade with varying word order is given in table 3.3.

4.2.1 Blocks, Randomized and Sequential

At first, for each subject words were recorded with the word orders blocks, sequentialand randomized. For these recordings, 6 subjects were chosen and for each of them,3 sessions were recorded. Half of the subjects had a short break and half of them ashortORlong break. The results can be seen in Figure 4.3.

Results

Overall, it can be said that the word order blocks yielded by far the best results,followed by randomized and sequential. On average, blocks yielded a word accuracyrate R of 52.96%, while the other two types only lead to a word accuracy of 19.22%and 19.83% respectively. However, these average recognition rates are influencedby the fact that the recognition rates for subject 10 were quite different from theothers. While for all the other subjects, it can be said that Rrandomized ≥ Rsequential,it was the other way round for subject 10 for whom Rsequential was significantlyhigher (Rrandomized = 18%, Rsequential = 34%). This may have been due to the factthat the randomized word order was recorded last and the subject was quite tired.In any case, the recognition rates of randomized and sequential were basically atchance level. In general, it can be said that

Rblocks > Rrandomized ≈ Rsequential

The length of the break in between the words seemed to make little difference: Inthe case of sequential and randomized word orders, the recognition rate was slightly

4.2. Variation of Word Order 35

2

Variation of Word Order (I)

0%

10%

20%

30%

40%

50%

60%

70%

80%

blocks randomized sequential

Word Order

Rec

ogni

tion

Rat

e

07 (short)

09 (short)

10 (short)

average short

08 (shortORlong)

11 (shortOrlong)

12 (shortORlong)

averageshortORlong

Variation of Word Order (II)

0%

10%

20%

30%

40%

50%

60%

70%

blocks short blocks randomized

Word Order

Rec

ogni

tion

Rat

e

13 (short)

16 (short)

17 (short)

average short

14 (shortORlong)

18 (shortORlong)

21 (shortORlong)

22 (shortORlong)

23 (shortORlong)

averageshortORlong

Figure 4.3: Variation of word order (I).

higher for short breaks compared to shortORlong breaks on average, while it wasslighty worse in the case of blocks.

Feedback from the Subjects

It turned out that the sequential word order was an unfortunate combination ofboring the subjects but still catch them unprepared. The subjects were bored sincethey always had to repeat the words in the same order. However, this cycle of wordsmade them less alert such that they tended to look at the screen less carefully. Thisin turn resulted in them not being totally sure which word had been displayed before.For instance, a subject still knew that the list had just started again but was notsure whether it had to think ’bravo’ or ’charlie’ now. This insecurity led to morerequests by the subjects to delete and repeat a word. Moreover, it can be assumedthat the estimated number of unreported cases was higher than for the other wordorders. This may partly explain why the recognition rates were the worst for thisword order.

For the word order randomized, the situation was different. This was perceivedas being a more difficult task than the others, requiring more concentration whichresulted in subjects who reported to have been more alert in general.

The word order blocks was perceived quite differently by the subjects. While somereported that this word order simply put them to sleep, other said that they foundthis presentation mode to be ruminant and comfortable. Unfortunately, both effectsled to subjects who were half asleep during the recording, especially when the sub-jects were already tired before the recordings had started. However, this ’peace ofmind’ might have also contributed to the good recognition results since the subjectwas usually totally relaxed, thus having few secondary objects in mind and relaxedmuscles resulting in fewer artifacts.


4.2.2 Blocks, Randomized and Short Blocks

Eleven subjects were recorded using the word orders blocks, randomized and the newword order shortBlocks. Again, about half of the subjects were recorded with shortbreaks between the words while the other half had breaks of randomized length.Unfortunately, for two subjects only blocks and shortBlocks could be recorded (bothhad short breaks) due to technical issues and time constraints. Furthermore, thedata from subject 19 had to be dismissed since it had to take antibiotics after asmall surgery resulting in EEG data which could not be used.In addition to the introduction of shortBlocks, the word order blocks was changedslightly. In the previous experiments, the blocks were presented in alphabeticalorder. This was changed by randomizing the order of the blocks (for reasons andresults see section 4.2.3).

Introducing shortBlocks

The main reason for introducing the word order shortBlocks was to investigate whythe recognition rates of blocks, sequential and randomized differed that much. Short-Blocks share one characteristic with blocks in that the subject is presented the sameword for n times in a row before switching to the next word, the difference beingthat n was set to 20 for blocks while it was set to 5 for shortBlocks.This could be seen as being a compromise between blocks and randomized. In blocks,the number of times m when a word wi was followed by this same word wi was set tom = 19, while it was 4 for shortBlocks and ranged between 0 and 4 for randomized.In contrast, m was set to zero for the sequential word order.

Another (unintended) difference between shortBlocks and blocks was that the sub-jects reported to have an orientation in time since they tended to internally countthe number of times that a word was repeated such that they knew beforehand whenthe system would switch to presenting a new word. Many said that this was morecomfortable than in other modes. However, shortBlocks were still more similar toblocks than to other word orders in that respect since in blocks, the blocks of wordswere separated by ’silence’. This also gave the subject a clear hint that the blockwould be switched. They usually lost track of how many words had been recordedso far, though, which is a difference to shortBlocks.

We chose short blocks of 5 words since this allowed us to split the 20 repetitions ofeach word in 4 short blocks. One reason for not using smaller blocks was that thesehave a higher probability of occuring in the randomized word order. Second, blocksthat were even smaller would have made it harder for subjects to concentrate, thehypothesis being that blocks yielded better results since long blocks of the same wordmade it easier for subjects to concentrate. Instead, we could have used short blocksof 10 words but this would have still been fairly vulnearble to temporal artifacts.

Results

The results can be seen in Figure 4.4. The picture remains similar to the one ofthe previously described experiment in that blocks of words yielded by far the bestrecognition results (44.65%). Blocks was followed by shortBlocks and randomized(22.10 % and 17.30% on average, respectively).

However, there were differences concerning the length of the word breaks.First, for those subjects with short breaks, the recognition rates were clearly higher

4.2. Variation of Word Order 37

2

Variation of Word Order (I)

0%

10%

20%

30%

40%

50%

60%

70%

80%

blocks randomized sequential

Word Order

Rec

ogni

tion

Rat

e

07 (short)

09 (short)

10 (short)

average short

08 (shortORlong)

11 (shortOrlong)

12 (shortORlong)

averageshortORlong

Variation of Word Order (II)

0%

10%

20%

30%

40%

50%

60%

70%

blocks shortBlocks randomized

Word Order

Rec

ogni

tion

Rat

e13 (short)

16 (short)

17 (short)

average short

14 (shortORlong)

18 (shortORlong)

21 (shortORlong)

22 (shortORlong)

23 (shortORlong)

averageshortORlong

Figure 4.4: Variation of word order (II).

for the word orders block and short block, while it did not matter for the word orderrandomized.Second, if the breaks were short, shortBlocks always yielded better results thanrandomized words. This was vice versa for 3 out of the 5 subjects with shortORlongbreak length.

Although this is not true for some subjects, it may still be said in general that

Rblocks > RshortBlocks > Rrandomized

This may be seen as a hint that the recording of words as blocks, even if they areshort, yields better results than other recordings modes.

4.2.3 Reordered Blocks

It was hypothesized in [9], that the word order blocks only yielded such good resultssince the temporal closeness of words was classified and not their intrinsic pattern,induced by the process of thinking the word. Our recordings also seemed to showthat words from neighbouring blocks were confounded more often than with distantblocks, i.e. ’alpha’ and ’bravo’ were confounded more frequently than ’alpha’ and’delta’. However, this could also have been due to some characteristics of the words,e.g. due to the fact that ’alpha’ and ’bravo’ share the vocal ’a’ in the first sylla-ble which ’delta’ does not. We wanted to further investigate this hypothesis andtherefore reordered the blocks of words by randomizing their order. This was donefor the subjects which were presented words in blocks, shortBlocks and randomizedword order (see section 4.2.2). For all of the other subjects, the alphabetical blockswere used. Thus, we had 11 sessions with alphabetical blocks and 11 sessions withreordered blocks. We calculated two separate confusion matrices, one over all thereordered blocks and one over all the alphabetical blocks. In both cases, we orderedthe matrix such that reference 1 was the first reference timewise and reference 5 wasthe last one. These matrices can be found in the appendix (Figure B.5).


Results

The results are very clear and support hypothesis B. There is indeed a correlationbetween the temporal neighborhood of two words and the likelihood that thesewords were confounded by the recognizer. As it can be seen in part (a) and (b) ofFigure 4.5, words of the first or the last block were recognized correctly more oftenthan the other words and they were more likely to be confounded with a differentword that was in a neighboring block. The latter was also true for the words inthe 3 blocks in the middle of the session. This can be summed up as follows: Themore distant a block B is from reference block A timewise, the less likely it is that aword from block B is to be confounded with a word from block A. Since this is truefor both blocks (in alphabetical order) and blocksReordered, this can have nothingto do with the fact that the neighboring words share certain characteristics and aretherefore confounded more often.

The picture is quite different if we have a look at words which were presented inrandomized order as can be seen in part (c) of Figure 4.5. In order to have acomparable data set to those in blocks / reordered blocks, only the data from 11subjects was chosen (subject ID p ∈ {07, ..., 17}).

These results suggest that the hypothesis B. is indeed true: The recognizer may haveonly recognized different states of mind over the time, such as for example alertnessin the beginning of the recording and decreasing concentration towards the end. Ithas been shown in [16], that the state of the subject and its mental task demandcan be determined by EEG data. Our findings indicate that this may be what hastaken place during our recordings.It makes also sense that the first and the last block are recognized more often thanthe others. The subjects tended to think the first block more quickly than the others.Furthermore, they were probably the most concentrated during the first block. Thelast block was special as far as it was clear to the subject that the session was almostover such that (s)he was prepared for a final spurt.

4.2.4 Overview: Impact of Word Order

It could be seen during the experiments, that there was a clear correlation betweenthe word order in which the words were presented and the recognition rate. Insummary, the picture was as follows:

Rblocks > RshortBlocks > Rsequential > Rrandomized

This can also be seen in Figure 4.6. The overall recognition rates for different blocksrecordings was 45.50% (average over the recordings with alphabetical and reorderedblocks) while that of the other word orders were at chance level as can be seen intable 4.2.

4.3 Cross Session Experiments

4.3.1 Cross Mode Testing

The main result of the section 4.2 was that there are significant differences in therecognition rate depending on the order in which words are represented, with blocks

4.3. Cross Session Experiments 39

0

1

2

3

4

5

0 1 2 3 4 5

(b) Hypothesis

Ref

eren

ce0

1

2

3

4

5

0 1 2 3 4 5

(a) Hypothesis

Ref

eren

ce

0

1

2

3

4

5

0 1 2 3 4 5

(c) Hypothesis

Ref

eren

ce

Figure 4.5: Impact of temporal closeness for (a) blocks recorded in alphabeticalorder, (b) blocksReordered (i.e. blocks that were recorded in random order) and (c)words recorded in randomized word order.


Word Order Average Min Max

Blocks (alphabetical) 45.95 31.50 68.00BlocksReordered 45.05 30.48 64.00ShortBlocks 22.10 13.00 31.00Randomized 19.48 10.00 26.00Sequential 18.09 12.00 34.00

Table 4.2: Overview over the recognition rates (%) depending on the word order.Version für Studienarbeit

Variation of Word Order (Overall)

0%

10%

20%

30%

40%

50%

60%

70%

80%

sequential blocks randomized shortBlocks

Word Order

Rec

ogni

tion

Rat

e

13 (short)16 (short)17 (short)14 (shortORlong)18 (shortORlong)21 (shortORlong)22 (shortORlong)23 (shortORlong)07 (short)09 (short)10 (short)08 (shortORlong)11 (shortORlong)12 (shortORlong)

Figure 4.6: Recognition rate depending on word order for all word orders used.

yielding the best result. One possible reason for this could be that the subjects weresimply more concentrated when blocks were recorded and therefore able to producethoughts that had less variation for each word. In contrast, during the recordingof the other word orders, there may have been more ’silence’ in the recording, i.e.recording time when nothing was actually thought or at least not the word that wasintended to be thought. Therefore, we run cross mode experiments. This meansthat we trained the HMM with data recorded in blocks since this yielded the bestresults. Then, we chose a different session of the same subject as a test set whichwas recorded using a different word order but without having moved the EEG cap.For these experiments, we chose the data from two subjects with word orders blocks,randomized and sequential and from two other subjects with word orders blocks,randomized and shortBlocks (see table 4.3).

The results clearly show that training and testing on different sessions with differentword orders does not work out. The recognition rates are all around chance level,i.e. 20% .However, there were differences depending on the type of the test set (see Figure 4.7).As it was said, the HMM was always trained with the blocks recordings. Whenthe HMM was tested with data recorded in randomized order, the recognition rateimproved slightly compared to the rate yielded with a HMM trained via round-robin on the same data. The same was true when the test set was composed of

4.3. Cross Session Experiments 41

Subject ID Blocks Randomized Sequential ShortBlocks

08 x x x09 x x x16 x x x17 x x x

Table 4.3: Data chosen for cross mode experiments.

data recorded in sequential order. However, for shortBlocks the rate deteriorated.Nevertheless, the recognition rates with cross mode testing were all at chance level.

24.00 33.3333333320 100

Cross-Mode Testing: Gain by Training on Blocks

-10%

-8%

-6%

-4%

-2%

0%

2%

4%

6%

8%

Randomized Sequential ShortBlocks

Mode on which was tested

Ab

solu

te C

han

ge

in R

eco

gn

itio

n

Rat

e (P

erce

ntag

e P

oint

s)

subject 08

subject 09

subject 16

subject 17

II. Cross-Mode Testing: Gain by Training on Blocks

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

TestRandom TestSequ TestShortBlock

Mode on which was tested

Rel

ativ

e C

hang

e of

Rec

ogni

tion

Rat

e

08

09

16

17

Figure 4.7: Cross mode testing: Gain by training on data recorded in blocks com-pared to training on the data itself via round-robin.

4.3.2 Cross Session Testing with Blocks

There are two facts which could have led to the bad recognition results described insection 4.3.1:

• the data was recorded in different sessions

• the sessions were recorded with different word orders

In order to clarify this, the same experiment was run with the data of subjects 04and 05. Both were recorded using blocks only but with varying break length. As itwas shown in section 4.1, this had only minor effects on the recognition rate. Sinceeach of the sessions per subject was recorded in blocks, we used each of them fortraining the HMM and tested the other two on it subsequently.As can be seen in the Figures 4.8 and 4.9, the results were almost as bad as for thecross mode experiments. Although some recogniton rates were higher than chance,most of them were not. Given that the recognition rates had been much higher


before, especially for subject 05, this shows that cross session testing does not workout, even when the same word order is used. This means at least, that the differencesbetween sessions are bigger than the similarities of the recorded words.

These results makes it very unclear whether EEG-based speech recognition couldpotentially be used for a real-life application. In this case, a training session wouldbe needed in order to produce data with which the HMM is trained. Subsequently,the so-trained HMM would have to work on ’new’ data from a new session.

CrossMode Testing: cross mode

for comparison: subject 04 and 05 only had blockwise recordings

Train01_Test01 Train02_Test02Train03_Test03Train01_Test02Train01_Test03Train02_Test0104 37.00% 31.50% 34.00% 30.30% 20.00%05 56.00% 42.00% 36.25% 20.00% 20.20% 17.00%

123456789

10111213141516171819


Subject 04Train on session 01Train on session 02Train on session 03

Session 01 37.00% 22.00% 20.00%Session 02 30.30% 31.50% 19.19%Session 03 20.00% 22.00% 34.00%



Cross-Session Testing (a)(Subject 04, blocks recordings)

0%

5%

10%

15%

20%

25%

30%

35%

40%

Session 01 Session 02 Session 03

Session on which was tested

Rec

ogni

tion

Rat

e

Train on session 01Train on session 02Train on session 03

Figure 4.8: Cross session testing with data recorded in blocks, subject 04.

CrossMode Testing: cross mode

for comparison: subject 04 and 05 only had blockwise recordings


123456789

10111213141516171819






Cross-Session Testing (b) (Subject 05, blocks recordings)

0%

10%

20%

30%

40%

50%

60%

Session 01 Session 02 Session 03

Session on which was tested

Rec

ogni

tion

Rat

e

Train on session 01Train on session 02Train on session 03

Figure 4.9: Cross session testing with data recorded in blocks, subject 05.

The data of both the cross mode and the cross session testing experiments can befound in Figure B.6) in the appendix.

4.4 Variation of the HMMAs described in section 3.3.2, a left-to-right Hidden Markov Model was used as aclassifier for the recognizer. The standard HMM that was used for training on all

4.4. Variation of the HMM 43

of the data had five HMM states (HS) and one Gaussian mixture model (GMM)per state. It had been shown in [40] that variations of the HMM can have an effecton the recognition rate. Therefore, we varied both the number of states and thenumber of Gaussians per state.

Due to time restrictions, only a subset of the recorded data was chosen for theexperiments (all three recording sessions from subjects 08, 09 and 16, see Figure 4.10for an overview over the subjects and sessions). Both subject 09 and 16 were female.For 08 and 09, the word order was varied in the same way (starting with sequentialfollowed by blocks and randomized), only the break length differed. The break lengthstayed the same for subjects 09 and 16. However, different word orders were recordedfor subject 16, using the order shortBlocks instead of sequential and reordering theword blocks in blocks. Additionally, we used the session data 18-01, 13-02 and 12-03in order to have 3 sessions for each word order.

The complete data of the experiments can be found in the Appendix B.7.

short long shortORlong short long shortORlongsequential - - 01 - - 13.00%blocks - - 02 - - 68.00%randomized - - 03 - - 18.00%sequential 01 - - 18.00% - -blocks 02 - - 55.00% - -randomized 03 - - 21.25% - -

12 sequential - - 03 - - 17.00%13 shortBlocks 02 - - 21.00% - -

blocks EDABC 01 - - 53.00% - -randomized 02 - - 16.00% - -shortBlocks 03 - - 31.00% - -

18 shortBlocks - 01 - - 27.00%

Session ID Break type

16

09

08

Speaker ID Word Order

Figure 4.10: Recordings chosen for experiments with HMM.

4.4.1 Variation of Gaussians per State

In [40], the best results for unspoken speech could be gained by using 4 Gaussians,followed by the results using 32 Gaussians (experiments with GMM ∈ {4, 8, 16, 32, 64}).For 64 Gaussians, the recognition rates deteriorated significantly. For these reasons,we experimented with 1, 4 and 32 GMMs.

In the following, different recordings of the same word order were compared .A relatively clear trend could be detected for the word order blocks (see Figure 4.11):1 GMM and 32 GMMs always yielded better results than 4 GMMs and the resultsof 1 and 32 GMMs were almost always comparably good. This is a bit different for1 HS where 32 GMM yields the best result. For better readability, only the data for3, 5 and 15 HS are shown in the diagram. These findings contradict the results ofMarek Wester who found that 4 GMMs were best.

For the word order sequential, 1 GMM yielded the best rresults for 3 and 5 HS. For1 HS, 4 GMM lead to better results. For the other conditions (10, 15 HS), therewas no clear trend which is why they were left out of the diagram (see Figure 4.12).

If the words were randomized or shortBlocks, no pattern could be detected.


15HS.32GMM

Variation of GMM (blocks)

0%

10%

20%

30%

40%

50%

60%

70%

80%

block / 1GMM block / 4GMM block / 32GMM

Word Order / # GMMs

Rec

ogni

tion

Rat

e

08-02 (3HS)

09-02 (3HS)

16-01 (3HS)

08-02 (5HS)

09-02 (5HS)

16-01 (5HS)

08-02 (10HS)

09-02 (10HS)

16-01 (10HS)

08-02 (15HS)

09-02 (15HS)

16-01 (15HS)

Variation of GMM (sequential)

0%

5%

10%

15%

20%

25%

sequ.1GMM sequ.4GMM sequ.32GMM

Word Order / # GMMs

Rec

ogni

tion

Rat

e

08-01 (3HS)

09-01 (3HS)

12-03 (3HS)

08-01 (5HS)

09-01 (5HS)

12-03 (5HS)

Figure 4.11: Variation of GMM for word order blocks.

15HS.32GMM

Variation of GMM (blocks)

0%

10%

20%

30%

40%

50%

60%

70%

80%

block / 1GMM block / 4GMM block / 32GMM

Word Order / # GMMs

Rec

ogni

tion

Rat

e08-02 (3HS)

09-02 (3HS)

16-01 (3HS)

08-02 (5HS)

09-02 (5HS)

16-01 (5HS)

08-02 (10HS)

09-02 (10HS)

16-01 (10HS)

08-02 (15HS)

09-02 (15HS)

16-01 (15HS)

Variation of GMM (sequential)

0%

5%

10%

15%

20%

25%

sequ.1GMM sequ.4GMM sequ.32GMM

Word Order / # GMMs

Rec

ogni

tion

Rat

e

08-01 (3HS)

09-01 (3HS)

12-03 (3HS)

08-01 (5HS)

09-01 (5HS)

12-03 (5HS)

Figure 4.12: Variation of GMM for word order sequential.

4.4.2 Variation of the Number of HMM States

When it comes to the variation of the number of HMM states (HS), no significantdifference in the overall performance across different modalities could be detected in[40], experimenting with HS ∈ {3, 4, 5, 6, 7}. Nethertheless, unspoken speech seemedto perform best with 3 states and the more the number of states increased, the morethe recognition rate deteriorated. Since our standard value was 5, we experimentedwith 1, 3, 5, 10 and 15 HMM states. Also, experiments were run using 100 HMMstates which are dealt with at the very end of this section.

Again, different recordings with the same word order were examined. For the blockword order, 1 HS generally yielded the best results with the exception of 16-01(1GMM) and 09-02 (1GMM) where 3 HS were better (see Figure 4.13). This resultis consistent with that reported in [40] where 3 HMM states also worked best.

Concerning sequential word order, most sessions were recognized best if 10 HMMstates were used.

4.4. Variation of the HMM 45

When it comes to randomized word order, no clear trend can be seen.For shortBlocks, 1 HS works out best.

Variation of HS (shortBlocks)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

1HS.short 3HS.short 5HS.short 10HS.short15HS.short

# HS / Word Order

Rec

ogni

tion

Rat

e

16-03 (1GMM)

18-01 (1GMM)

13-02 (1GMM)

16-03 (32GMM)

18-01 (32GMM)

13-02 (32GMM)

Variation of HS (blocks)

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

block /1HS

block /3HS

block /5HS

block /10HS

block /15HS

Word Order / # HSs

Rec

ogni

tion

Rat

e

08-02 (1GMM)

09-02 (1GMM)

16-01 (1GMM)

08-02 (4GMM)

09-02 (4GMM)

16-01 (4GMM)

08-02 (32GMM)

09-02 (32GMM)

16-01 (32GMM)

Figure 4.13: Variation of HS for word order blocks.

An overview of which combination of HS and GMM yielded the best results isprovided in table 4.4.

Blocks Sequential Randomized ShortBlocks

1 HS 32 GMM 4 GMM - -3 HS 1 GMM 1 GMM 4 GMM -5 HS 1 / 32 GMM 1 GMM - 4 GMM10 HS - - 1 GMM -15 HS 1 / 32 GMM - - 32 GMM

1 GMM 1 / 3 HS 10 HS - 1 HS4 GMM 1 3 HS 10 HS - -32 GMM 1 HS - - 1 HS

Table 4.4: Overview over the best HMM configurations when either the number ofGaussians per state (GMM) or the number of HMM states (HS) is varied.

4.4.3 Possible Reasons for Differences to previous findings

It has to be taken into account that the experiments in [40] were done using adifferent type of preprocessing: A windowed Fourier transformation with a windowsize of 26.6 ms and a window shift of 4 ms which decomposed the input signal


into 12 subbands was applied. Due to the findings in [39], the preprocessing of thedata of the present work was changed to Double-Tree Complex Wavelet Transform(DTCWT) (see also section 2.3.2). Thus, the results are not really comparable toeach other.

Also, in [40], the words were always presented to the subjects in blocks which wasonly the case for 3 of the recordings examined here. However, even for those, theresults differed from those of Marek Wester.

Eventually, we also run some experiments using 100 HMM states. However, this leadto a very strong decrease of recognition rates down to less than 10%. Interestinglyenough, this did not depend on the word order and levelled the differences betweenthe recognition rates.

4.5 Examination of the recorded Data

4.5.1 Eliminating Useless Recordings

Motivation

In (spoken) speech recognition, it is a well-known issue that some subjects haveproblems using the push-to-talk button properly when speech is to be recorded.Some start speaking too early such that the recording is corrupted (in case that justone word or a short sequence is recorded). Others start too late or wait too longbefore releasing the button. Thus, silence is recorded before and/or after the word.

It can be assumed that these problems also occur when unspoken speech is recorded,even more so because the task is much less familiar than speaking. The same is truefor the segmentation of the words by blinking with both eyes. Unfortunately, thesupervisor can only check whether the blink for segmentation seems alright. It isimpossible to assess whether the EEG data is fine by just looking at it. Only if therecording of an utterance takes more time than usual, it can be guessed that silenceis recorded . But this also depends on the subject and has a huge variation betweenand within subjects (see section 4.5.2).

Experience shows that about 20% to 30% of the data recorded for (spoken) speechrecognition by using a push-to-talk system are useless due to the issues describedabove. We assumed that the percentage of useless EEG recordings of unspokenspeech is at least as high if not higher. Since we had recorded 20 repetitions ofeach word, 20% to 30% corresponded to 4 to 6 words that were potentially useless.The aim was to seperate these utterances such that we could train and test on thevaluable data only.

This refers again to hypothesis A. If A is right, blocks should contain less garbagethan the data recorded with the other modes, simply because this mode facilitatedthe task for the subjects. Therefore, we hypothesized that the recognition rate ofblocks would deteriorate since valuable data for training would be lost whereas theother modes would probably gain from a smaller but more valuable subset of data.

Experiments and Results

For this experiment, a small number of sessions was chosen, one for each word order(see table 4.5).

4.5. Examination of the recorded Data 47

SubjectID-SessionID Word Order

09-01 sequential09-02 block09-03 random13-02 shortBlocks

Table 4.5: Data Chosen for Viterbi Experiment.

The idea was to separate the utterances based on their Viterbi score. We examinedthe accumulated Viterbi score that was averaged over the number of frames for eachutterance. As it turned out, some utterances had an invalid score of 0, probablybecause the beam was too small and all the paths for backtracing had been prunedaway. Therefore, the first subset was called ’no 0 scores’ and contained only datawith valid Viterbi scores. Based on this subset, we built two other subsets: best 16and best 14 (see table 4.6).

Name of the (Sub)Set Data (selected per word)

all data original sessionno 0 scores only those utterances with a valid Viterbi scorebest 16 the 16 utterances with the best Viterbi scoresbest14 the 14 utterances with the best Viterbi scores

Table 4.6: Definition of Data Subsets for Viterbi Experiment.

The results can be seen in Figure 4.14. The recognition rate of the blocks sessiondoes indeed deteriorate when the amount of data is reduced. This is probably thecase because less noise is contained in the data. The other modes profit from the’better’ subset, although 14 utterances are obviously too little since there seems tobe an overtraining effect. It is unclear, though, why blocks yields a better result for’best 14’ than for the other subsets. In general, these results support hypothesis A.However, more experiments would be needed to really be able to make a statement.This was not feasible due to time constraints.

However, when screening the Viterbi scores of the other data, it turned out thatthere were two data sets which were special:

• Session 02 from subject 21 had 0-scores for all but one recording of the word’delta’ (word order: shortBlocks).

• Session 03 from subject 18 had 0-scores for 16 of the ’alpha’ and 13 of the’bravo’ utterances (word order: blocks).

After the recording, subject 21 had reported that she was working for an organizationnamed ’delta’, so it seems that something has been triggered in her brain every timeshe was asked to think of this word. In any case, it seems that there has been somefeature in the EEG data that was specific to the word ’delta’ (or to the words ’alpha’and ’bravo’ in the case of subject 18). This may be seen as a support for hypothesisA insofar as it shows that the measured EEG data is indeed related to the words.


Averaged over all recordings, there were 2,62 0-scores per session, so the abovereported cases are exceptions. The average values can be found in table 4.7.

Word Order Average Number of 0 Viterbi Scores

shortBlocks 4.6blocks 2.77sequential 2.53random 1.55

Table 4.7: Average amount of 0 viterbi scores for the different word orders.

Viterbi Analyse

Viterbi0 Viterbi16 Viterbi14

Session Session (word order) all data no 0 scores best 16 best 14 # 0 scores09-01 09-01 (sequential) 18.00% 19.47% 20.00% 20.00% 409-02 09-02 (blocka) 55.00% 49.00% 40.00% 52.86%09-03 09-03 (randomized) 21.25% 23.75% 18.57%13-02 13-02 (shortBlocks) 21.00% 27.37% 23.75% 22.86% 0

21-02 shortBlocks 18.00% 31.25% 1918-03 block 35.00% 53.33% 29

Recognition Rate

Viterbi

0%

10%

20%

30%

40%

50%

60%

all data no 0 scores best 16 best 14

Data subset used

Rec

ogni

tion

Rat

e

09-01 (sequential)

09-02 (blocka)

09-03 (randomized)

13-02 (shortBlocks)

Figure 4.14: Influence of the data subset used (choice base on Viterbi score) on therecognition rate.

4.5.2 Length of the Utterances

The same data subset (see table 4.5) was examined more closely with respect tothe length of the utterances in frames (after preprocessing). The average numberof frames can be seen in Figure 4.15. It showed that the standard deviation wastremendous for the recordings of subject 09 and much lower for subject 13. Thedata is provided in Figure B.9 in the Appendix. When we have a look at the blockrecording 09-02, we can see that the standard deviation of the number of frameswas very high (164 on average) given an average number of frames of 317. However,given these variations, the recognition rate is fairly good (55%). More details forthis data set can be found in Figure 4.8.

When the rest of the data was screened, it showed that there were both subjectswhose utterances were consistently very short (below 100 frames) and subjects whoseutterances were consistently very long (more than 300 frames). In general, it can besaid that the length of the recording, that is the number of frames, seems to dependstrongly on the subject.

4.6. Impact of Handedness on Recognition Rate 49

Word Average # Frames Standard Deviation Min Max

alpha 342.79 168.02 12 555bravo 323.53 170.82 9 571charlie 323.74 168.55 6 476delta 280.05 162.91 36 516echo 314.68 140.12 94 534all words 316.96 163.78 6 571

Table 4.8: Details for the number of frames for session 09-02 (blocks) depending onthe word that was uttered.

x

Average Number of Frames per Word

0

100

200

300

400

500

600

alpha bravo charlie delta echo all

Word

Ave

rage

# f

ram

es/w

ord

09_01 (sequential)

09_02 (blocks)

09_03 (randomized)

13_02 (shortBlocks)

Figure 4.15: Average number of frames per utterance (averaged for each session)and the corresponding standard deviation.

4.6 Impact of Handedness on Recognition Rate

As mentioned before (see section 3.5.1), the handedness of the subjects was assessedusing the Edinurgh Inventory [29]. When asked, the main subject of the experimentsdescribed in [40], also filled out the questionnaire, yielding a value of +100 (strongright-handed).

Figure 4.16 shows the influence of handedness on the recognition rate, dependingon the word order used. As described before, the Edinburgh index ranges from -100(strong left-handed) to +100 (strong right-handed). The lower the absolute value ofthe index, the more a person is ambidextrous.There were two sinistrals among the subjects (10, 13), which is a normal percentagegiven that a total of 23 subjects were recorded. Furthermore, two persons wereambidextrous (15, 18), one of which (18) yielded a rather high Edinburgh index(55,6). Moreover, one subject said to be right-handed but yielded a rather lowEdinburgh index of 68,4 (22). These five sujects are the leftmost subjects in thediagrams.


As can be seen in diagram (a) of Figure 4.16, handedness seems to have an effecton the recognition rate of recordings with blocks word order: The five leftmostsubjects which are either left-handed or more or less ambidextrous (as describedabove) yielded a worse recognition rate than the others.

This is not the case for the randomized word order (see diagram (b) of Figure 4.16).Almost all recognition rates are at chance level, independent on the handedness. Thesubjects which were stronger right-handed did yield slightly, but not significantlybetter results.

No clear trend can be detected when it comes to shortBlocks (see diagram (c) ofFigure 4.16).

Possible reasons for the impact of handedness

As it was mentioned before (section 3.2.1), the electrodes were distributed equallyover the scalp in the areas of interest as can be seen in figure 3.2, except for 2electrodes. Electrode 1 was used for picking up the blink signal. Electrode number2 was located as far away as possible from the motor strip, still picking up a greatpart of signals from Brocas area of the left cortical hemisphere. However, thisworks out only if speech is indeed processed in the left hemisphere. As describedin section 2.1.2, this so-called lateralization depends on handedness, among others.The more left-handed a person, the more likely it is that the right hemisphere isdominant in speech processing. This may have led to the differences that were foundfor the recognition rate of block recordings. It does not explain, however, why thereis no apparent difference for the other modes examined.

4.7 Comments on the Vocabulary Domain

As it was pointed out in [9], the vocabulary domain alpha was chosen since it wasassumed to have certain characteristics, among others that they are easily distin-guished when spoken and had no familiar semantic meaning. However, it is notguaranteed that words whose audio signals are easily distinguished produce brainsignals when thought which can be easily distinguished as well. Furthermore, manysubjects reported that they had indeed various and often similar associations whenthinking the words. For instance, some associated the word ’bravo’ with applause,’charlie’ with Charlie Chaplin, ’delta’ with a student-led consultancy at the Uni-versity of Karlsruhe. The latter was even visible in the EEG data since all of theutterances of ’delta’ from this subject had an accumulated averaged Viterbi score of0 (see section 4.5.1).

4.7. Comments on the Vocabulary Domain 51

(a) Influence of Handedness (Blocks )

0%10%20%30%40%

50%60%70%80%

-100 -50 0 50 100 150

Edinburgh Index

Rec

ogni

tion

Rat

e

(b) Influence of Handedness (Random)

0%

5%

10%

15%

20%

25%

30%

-100 -50 0 50 100 150

Edinburgh Index

Rec

ogni

tion

Rat

e

(c) Influence of Handedness (ShortBlock )

0%

5%

10%

15%

20%

25%

30%

35%

-100 -50 0 50 100 150

Edinburgh Index

Rec

ogni

tion

Rat

e

Figure 4.16: Influence of handedness on recognition rate of recordings with the wordorders blocks (a), random (b) or shortBlocks (c).

5. Analysis of the Results

The main purpose of this thesis was to find out which one of the following hypothesesis true:



The first experiment was to vary the break length between the words while keepingthe word order fix. The intention of this was to eliminate the potential temporalpattern that may have been the reason for the good recognition results for blockrecordings. So if hypothesis B was right, the recognition results for blocks woulddeteriorate for long breaks. While the results were not really clear when comparingwithin subjects, a correspondence could be found when the data of all recordingswas taken into account at the end of the study. The recognition rate for blockrecordings was best using short breaks, followed by random breaks (with a verymarginal difference). The results were much worse when a long break was used.In contrast to this, long breaks yielded better results for the sequential word orderwhereas it did not have an effect on the random word order (see Figure 4.2). Thissupports hypothesis B, but it has to be taken into account that only 3 subjectswere recorded with long breaks such that more recordings would be necessary for adefinite answer on this question.

Result 1. The recognition rates for block recordings suffer from long breaks in con-trast to other word orders. The longer the break, the more the recognition ratesdecrease.

It has to be taken into account, though, that the longer breaks could have also ledto subjects who are bored and thus not as concentrated anymore, see hypothesis A1.

Since the initial experiments with breaks did not seem to yield to conclusive results,we started a new set of recordings where the break type was fix and the word orderOi was changed between the recording sessions (Oi ∈ {blocks, random, sequential}).

54 5. Analysis of the Results

The purpose was to record each subject with different word orders in order to beable to directly compare the recognition rates with each other. If hypothesis B wasright, random and sequential would yield worse results than blocks. This was thecase. In concordance with hypothesis B, it can be said that only blocks yield resultsabove chance level (45.50% averaged over all recordings) and this is the type of wordorder that is vulnerable to temporal artifacts.

Result 2. Only block recordings yield recognition rates significantly above chancelevel (average over all recordings: 45.50%).

However, the feedback from the subjects hinted at an alternative explanation: Therecognition results for blocks may be much better than the others because the subjectcan concentrate better when the words are presented in blocks. This word orderfacilitates to produce thoughts in a similar way since the words are repeated like amantra. So hypothesis A can be amended with the following:

Hypothesis A1 The good recognition rates for block recordings are caused by thefact that this mode facilitates thinking the words in a consistent way.

In order to evaluate this new idea, we experimented with shortBlocks of 5 wordswhich share some of the properties of blocks. If A1 was right, shortBlocks shouldyield much better results than random, but worse than blocks. This was more or lessthe case as can be seen in Figure 4.4 and table4.2. However, the recognition resultswere very close to chance level (average recognition rate: 22.10%).

Result 3. Although shortBlocks share some important characteristics with blocksthe average recognition rate (22.10 %) is much lower than for blocks (45.50%).

The next step was to use ReorderedBlocks in the recordings. Reordered means thatthe blocks still consisted of blocks of 20 words but were arranged in a different order.We then calculated two separate confusion matrices, one over all the reordered blocksand one over all the alphabetical blocks. In both cases, we ordered the matrix suchthat reference 1 was the first reference timewise and reference 5 was the last one.These two matrices showed the same characteristic pattern (compare diagram (a)and (b) of Figures 4.5. Furthermore, it can be seen that the more distant a blockB is from reference block A timewise, the less likely it is that a word from blockB is to be confused with a word from block A. Since this is true for both blocks(in alphabetical order) and blocksReordered, this can have nothing to do with thefact that the neighboring words share certain characteristics and might therefore beconfused more often.These facts are a strong indication that hypothesis B is correct. At least, thisresult shows that temporal arfifacts definitely superimpose our signal of interest forblocks recordings. This proves the second part of hypothesis B which says that therecognition results of block recordings were overestimated due to temporal patterns.

Result 4. Temporal artifacts superimpose the signal of interest in blocks recordings.

However, this does not yet answer our fundamental question whether it is possibleto recognize unspoken speech based on EEG signals since it does not necessarilymean that there is no speech-related signal to be identified.

55

We then turned to cross mode testing to examine an alternative reason for whyblocks yield better results.This might be caused by the fact that there is more valuable data available fortraining since the words are thought in a more consistent way (as proposed byhypothesis A1). Thus, the blocks model is better trained and then yields higherrecognition rates. So we ammended hypothesis A in the following way:

Hypothesis A2. Block recordings lead to more valuable data containing less noiseand showing less variance in the length of the utterances.

In order to examine this, we trained an HMM with blocks data and tested it on datafrom the same subject that was recorded using different word orders. If hypothesisA was right, the recognition rate should improve. Instead, it stayed at chance level(see Table B.6).Before drawing any conclusions out of this result, we run the same experiment ondifferent sessions from the same subject, the difference to the previous experimentbeing that these session were all recorded with the word order (blocks). The recog-nition result deteriorated significantly down to chance level (see Figure 4.8), as hadbeen the case for different word orders. This can be summarized as follows:

Result 5. Cross session training with the current system does not yield recognitionrates above chance level even if the recording is done with the same word order.

It can be inferred that building a real-life system with biofeedback would be ex-tremely challenging: For biofeedback, a training session needs to be recorded firstsuch that the system can be trained. However, if the differences between sessionsare as big, it can be doubted if such a system is feasible. However, it should be takeninto account that it has been shown in [35] that subject-independent predictions arepossible. However, a much larger database was used and the subject-independentmodel was build by averaging over half of the data.

Since cross session training did not work independent of whether the same mode ora different mode was used, we cannot infer whether this contradicts hypothesis A2.

Therefore, the recorded data was examined in more detail. The aim of this was tocheck whether there is a correlation between the mode and and the length of eachutterance (between the two eye blinks). This length can be described by the numberof frames per utterance after preprocessing.

If blocks had a more regular pattern (similar average number of frames, smallerstandard deviation) than the other modes, hypothesis A2 was supported. Due totime constraints only a small subset of data was examined in detail where thisassumption could not be proved (see Figure 4.15). However, it could be seen thatblocks recordings yield a good recognition rate even though the standard deviationis fairly high. Moreover, screening the rest of the data showed that the variancebetween subjects is very large (average below 100 frames up to more than 300frames). Some subjects were able to think the words more consistent timewise if theblock mode was used, but this was not the case for all subjects. A more detailedanalysis would be needed.

Eventually, we tried to select the utterances which had yielded high Viterbi scores inorder to reduce the noise in the data. The training was then repeated on this smaller

56 5. Analysis of the Results

subset. If hypothesis A2 was right, blocks should suffer from having less valuabledata, whereas the recognition performance of the other modes should improve. Thisis exactly what happened.

Result 6. Data recorded in block contains less noise than that recorded with otherword orders.

Besides these experiments, we also assessed the correlation between handedness andrecognition performance. It was shown that the subjects who had a lower EdinburghIndex than others (that means that they were more left-handed or ambidextrous),yielded worse recognition rates for the blocks mode. This was not the case for theother modes, though. According to [19], a lower Edinburgh Index results in a higherprobability that language is processed in the right hemisphere. Our recording wasset up for left-hemispheric language processing, though (see section 3.2.1). So it canbe argued that the fact that an influence of handedness shows for the blocks data,the data must contain signals which have somehow to do with speech processing.

Finally, we run experiments with varying Hidden Markov Models in order to makesure that we were using an appropriate model for the words. We varied both thenumber of HMM states and the Gaussians per state. The ideal parameters dependedon the word order used but no clear tendency could be detected (for an overviewof the results, refer to table 4.4). This may be another hint that actually differentthings were recorded when using different word orders and may be seen as support-ing hypothesis B. Furthermore, it can be seen that HMMs with just 1 state workamazingly well. Thus, it seems that the data were constant in pattern.

Result 7. The ideal parameters of the HMM depend on the word order used. Therelevant information seems to be contained in very few frames.

6. Summary and Future Work

6.1 Summary

The main purpose of this thesis was to find out whether silent speech can indeed berecognized based on EEG signals. While some promising first results had been shownin [40], new questions and doubts were raised by [9], arguing that temporal patternswere recognized instead of words. These two hypotheses as stated in section 1.3were refined during the course of this thesis:

Hypothesis A. Silent speech can be recognized based on EEG signals. The fact thatother word orders yield worse recognition rates can be explained by two factors:

• A1: Block recordings facilitate thinking the words in a consistent way.

• A2: Block recordings lead to more valuable data containing less noise andshowing less variance in the length of the utterances.


In order to shed light on these hypotheses we recorded a huge set of data from21 subjects. In the optimal case, each of the subjects was recorded 3 times withvarying presentation modes. There were two variables which could be altered: theword order Oi and the length of the break between the words Bj. Thus, a recordingR(s) looked as follows (Wi being a word list):

R(s) = (S1, S2, S3)

= ((W1(B1), O1), (W2(B2), O2), (W3(B3), O3))

Oi ∈ {blocks, random, sequential, shortBlocks}Bj ∈ {short, long, randomized}

Experiments were run on different aspects and delivered results that supported eitherhypothesis. These are analysed in detail in section 5. In the following, the resultsare summarized.

58 6. Summary and Future Work

The main question that we want to answer is whether EEG-based recognition ofunspoken speech is feasible. This leads inevitably to the question of why blockrecordings yield promising results whereas other word orders do not. We could showthat the data recorded in blocks contains less noise than that recorded with otherword orders (supporting A2). Also, the feedback from the subjects showed that it iseasier to think words in a similar way when they were asked to do so in blocks. Thesame was reported for shortBlocks which share some important characteristics withblocks and were introduced in the experiments for exactly this reason. However,the average recognition rate for shortBlocks (22.10 %) is basically at chance leveland thus much lower than for blocks (average over all recordings: 45.50%) whichcontradicts hypothesis A.

In general, only block recordings yielded recognition rates significantly above chancelevel in our experiments, whereas all other word orders were just at chance level.It has been hypothesized that temporal patterns were recognized in the block datainstead of words. This was supported by two findings. First, the recognition ratesfor block recordings suffered from long breaks in contrast to other word orders. Thelonger the break, the more they deteriorated which may be a result of the factthat the temporal pattern was influenced. Second, the ideal parameters for theHMM depend on the word order used which could be a hint that different thingswere recognized. However, the most important result is that it could be shownthat temporal artifacts superimpose the signal of interest in blocks recordings (bycomparison with reorderedBlocks).

This does not yet mean that it is not feasible in general to extract information aboutthe processing of unspoken speech from EEG signals with our system. However, itdoes show that the experimental setup used for our experiments is not suitable.Furthermore, our experiments showed some feasability issues: Cross session training(within subjects) only yields recognition rates at chance level with the current systemeven if the recording is done with the same word order. This makes it questionablewhether a system for real-life applications including biofeedback could be built.

Overall, it can be said that although our results contain indications for both hy-potheses, it seems by far more likely that hypotheses B by the author of [9] iscorrect. Thus, we can confirm to the best of our knowledge, that temporal patternsare recognized when using block recordings, raising the question whether a differentexperimental setup might be more successful.

6.2 Future Work

Variance in the Length of the Utterance

Although we started evaluating the quality of the recorded EEG data (section 4.5),this could not be done in depth due to time constraints. While we only chose asmall subset of the data for our examinations, this could be done for all the data.An important question would be to assess the corelation between the variance of thelength of the recordings and the word order in more depth. Furthermore, it wouldmake sense to develop a method for normalizing the data and to examine whetherthis improves the recognition rate. This would be valuable due to the high standarddeviation.

6.2. Future Work 59

Modeling of the Words

During the experiments with the parameters of the Hidden Markov Model, it turnedout that HMMs with just one state yielded fairly good results. A one state HMMhowever does not model temporal data anymore. This can be seen as a hint that adifferent word model for the EEG data might be more suitable and may yield betterresults.

Different Vocabulary Domains

As mentioned before (section 4.7), the words of the vocabulary were not free ofsemantic meanings for the subjects. So far, no vocabulary domain has been triedwere words were intentionally chosen because of their semantic meaning since thepurpose had been to show that the speech signals could be recognized and notother brain activity. However, in the current study it did not seem to make a hugedifference whether the subjects had associations to the words or not. So it could beworthwhile to experiment with vocabulary domains which explicitly have a semanticmeaning.

Handedness

Since it only turned out in the evaluation of the current study that handedness mayplay a role for EEG-based speech recognition, no particular concern was put onfinding subjects who present a wide range concerning their Edinburgh index. Thus,we had only few subjects with a negative index (i.e. few subjects who were left-handed to some degree) and even fewer who were close to being ambidextrous. Inorder to examine the potential correlation in more depth, a study would be neededwith subjects equally distributed across the Edinburgh Index.

Feedback

The good results of [40] may also have been influenced by the fact that the mainsubject took part in the experiment for many times in a row, leading to a fairlywell-trained subject.

For this reason, it would be interesting to study this training effect by providing thesubject with feedback. First, a training session would be needed, with a sufficientlyhigh number of repetitions of each word. During this first session, no feedbackwould be provided. Afterwards, the system would be trained with the data in orderto enable the system to do online recognition. In the ensueing feedback session,the subject would interact with this online system. The system would then be ableto present the result of the recognition procedure after each thought word, thusproviding the subject with feedback. In this way, it could be tested whether thesubject can adapt its’ brain waves such that they are more easily recognized by thesystem. It has been shown in [4] that subjects can indeed be trained to modify theirbrain waves for a brain-computer interface based on EEG.

A major issue for a neurofeedback system is the session dependency of the HMMs.As can be seen in section 4.3, it is almost infeasible to train the system on the dataof one session and test it on a different session. However, this is only true for asystem without feedback.

60 6. Summary and Future Work

A. Documents for Recordings

As described in section 3.5.2, a written instruction for the upcoming experimentwas handed out to each subject. This instruction can be seen in figure A.1. Sinceall subjects were fluent in German, the instructions were in German as well.

The supervisor of the experiment collected statistics concerning both the experimentand the subject. The protocol used for this purpose can be seen in figure A.2.

62 A. Documents for Recordings

Liebe Probandin, lieber Proband! In diesem Experiment wird untersucht, wie das menschliche Gehirn Sprache verarbeitet. Da jede Bewegung während der Aufzeichnung zu einer Verfälschung der Daten führt, sollten Sie während des Experiments versuchen, sich möglichst wenig zu bewegen. Das gilt vor allem während der Aufnahme selbst, also wenn das entsprechende Wort gedacht wird (Phase 3, sh unten). Im Rahmen des Experiments werden folgende Wörter präsentiert, die gedacht werden sollen: alpha, bravo, charlie, delta, echo Jedes dieser Wörter wird 20mal wiederholt Außerdem wird ab und zu „[…]“ angezeigt, wobei möglichst nichts gedacht werden sollte (zumindest keines der fünf oben genannten Wörter). Die Abfolge für jedes Wort sieht folgendermaßen aus:

In Phase 1 wird das Wort zuerst angezeigt, dann erscheint ein blauer Bildschirm (Phase 2). Wenn danach in Phase 3 der weiße Bildschirm erscheint, tun Sie bitte Folgendes:

1. zuerst blinzeln 2. das vorher angezeigte Wort denken 3. danach wieder blinzeln

Das Blinzeln dient zum späteren Segmentieren der Wörter und wird aus der Aufnahmen herausgeschnitten. Daher sollten Sie das Wort erst anfangen zu denken, wenn das Blinzeln abgeschlossen ist. Wenn Sie das Wort denken, ist es wichtig, dass Sie sich vorstellen, dieses Wort auszusprechen. Dabei sollten Sie jedoch nicht Ihre Zunge oder Ihre Gesichtsmuskulatur bewegen. Neben Körperbewegungen ist es auch wichtig Augenbewegungen zu vermeiden. Solange der Bildschirm weiß ist, läuft die Aufnahme. Sobald diese gestoppt wird, wird der Bildschirm schwarz und es gibt eine kurze Pause von einigen Sekunden (Phase 4). Wird der Bildschirm zu früh schwarz (etwa, bevor das zweite Blinzeln erfolgt ist) oder zu spät (wenn das Wort längst gedacht ist und Sie sich eventuell schon bewegt haben), sagen Sie bitte der Versuchsleiterin Bescheid. Gleiches gilt, wenn Sie sich „verdacht“ haben. Dann kann die Aufnahme dieses Wortes gelöscht und wiederholt werden. Wenn Sie zwischenzeitlich eine längere Pause benötigen, dann melden Sie sich bitte ebenfalls. Vor Beginn des Experiments gibt es zunächst einen kurzen Übungsdurchgang, der dazu dient, dass Sie sich mit dem Ablauf des nachfolgenden Teils vertraut machen können. Insgesamt gibt es 3 Aufnahmeblöcke, zwischen denen es jeweils längere Pausen von ca. 10min gibt. In jedem Aufnahmeblock werden Ihnen die fünf Wörter in unterschiedlichen Reihenfolgen präsentiert. Sollten Sie noch Fragen zu dem Experiment haben, wenden Sie sich bitte an die Versuchsleiterin. Vielen Dank für Ihre Mithilfe bei diesem Experiment!

alpha

2 sec

Phase 1: lesen

2 sec

Phase 2: konzen-trieren

2 sec

Phase 4: Pause

alpha

…

Phase 3: blinzeln – denken – blinzeln

(Aufnahme)

Figure A.1: Instructions as they were handed out to the subjects

63

Name Versuchsleiter Datum Uhrzeit Licht

( ) an ( ) aus

Name Proband Speaker ID Domain Break Type

Geschlecht Geburtsdatum / Alter Session ID Word Order Start End

( ) weiblich ( ) männlich

Händigkeit Sehstärke ( ) korrigiert

( ) rechts ( ) links

Nationalität Studienfach In den letzten 24h Alkohol / Drogen / Medikamente

Bezug zu Sprachen?

Bezug zu Musik?

sonstiges

Bemerkungen Versuchsleiter:

Feedback Proband:

Wachheit

( ) aufgeregt ( ) wach ( ) schläfrig

( ) keine ( ) etwas ( ) viel

Nervosität

Versuchsprotokoll

Gesundheitliche Beeinträchtigung

( ) keine ( ) etwas ( ) stark

( ) keine ( ) ja - wieviel, was

Figure A.2: Protocol filled out by the supervisor before the start of the recording

64 A. Documents for Recordings

B. Data

In this part of the appendix, the recognition rates are given for all the recordingsand experiments that were made. The recognition rates were yielded using ourstandard 5 states left-to-right HMM with 1 Gaussian Mixture Model per state, ifnot otherwise stated.

Word Orderrandomized 03-03 20.00% 03-01 22.50% 03-02 22.00%

05-05 26.00% 05-06 21.00% 05-04 20.25%sequential 02-01 19.20% 02-02 13.00% 02-03 14.00%

06-03 12.00% 06-02 14.40% 06-01 15.20%05-08 27.00% 05-07 18.00% 05-09 20.00%

blocks 04-02 31.50% 04-01 37.00% 04-03 34.00%05-01 56.00% 05-03 36.25% 05-02 42.00%01-01 72.80%

Word Orderrandomized 07-02 25.25% 08-03 18.00%

09-03 21.25% 11-01 19.00%10-03 18.00% 12-01 18.00%

sequential 07-03 18.00% 08-01 13.00%09-01 18.00% 11-02 18.95%10-02 34.00% 12-03 17.00%

blocks 07-01 52.00% 08-02 68.00%09-02 55.00% 11-03 52.73%10-01 41.00% 12-02* 49.00%

3) Overview of all results

Word Orderrandom 03-03 20.00% 03-01 22.50% 03-02 22.00%

short long randomized

short shortORlonglong

Break Type

2.a) Variation of Word Order

Break Type

Overview Results


short long shortORlongBreak Type

Figure B.1: Recognition rates for varying break types while the word order was fixduring the recording of one subject

Word Orderrandomized 03-03 20.00% 03-01 22.50% 03-02 22.00%

05-05 26.00% 05-06 21.00% 05-04 20.25%sequential 02-01 19.20% 02-02 13.00% 02-03 14.00%

06-03 12.00% 06-02 14.40% 06-01 15.20%05-08 27.00% 05-07 18.00% 05-09 20.00%

blocks 04-02 31.50% 04-01 37.00% 04-03 34.00%05-01 56.00% 05-03 36.25% 05-02 42.00%01-01 72.80%


09-03 21.25% 11-01 19.00%10-03 18.00% 12-01 18.00%

sequential 07-03 18.00% 08-01 13.00%09-01 18.00% 11-02 18.95%10-02 34.00% 12-03 17.00%

blocks 07-01 52.00% 08-02 68.00%09-02 55.00% 11-03 52.73%10-01 41.00% 12-02* 49.00%

3) Overview of all results


short long randomized

short shortORlonglong

Break Type

2.a) Variation of Word Order

Break Type

Overview Results


short long shortORlongBreak Type

Figure B.2: Recognition rates for varying word orders (blocks, randomized, sequen-tial) but with a fix break type was fix during the recording of one subject

66 B. Data

2.b) Variation of Word Order (new)


16-02 16.00% 18-02 10.00%17-03 18.00% 21-03 17.00%19-01 16.00% 22-01 18.00%

23-03 20.00%shortBlocks 13-02 21.00% 14-02 17.00%

15-03 22.00% 18-01 27.00%16-03 31.00% 21-02 18.00%17-01 27.00% 22-03 13.00%19-02 14.00% 23-01 15.00%20-01 30.00%

blocksReordered 13-01 44.00% 14-03 41.00%15-02 33.00% 18-03 35.00%16-01 53.00% 21-01 64.00%17-02 61.75% 22-02 34.29%19-03 20.00% 23-02 30.48%20-02 50.00%

Break Typeshort long shortORlong

Figure B.3: Recognition rates for varying word orders (blocks, randomized, short-Blocks) but with a break type that was fix during the recording of one subject

67


05-05 26.00% 05-06 21.00% 05-04 20.25%07-02 25.25% 08-03 18.00%09-03 21.25% 11-01 19.00%10-03 18.00% 12-01 18.00%13-03 15.38% 14-01 24.00%16-02 16.00% 18-02 10.00%17-03 18.00% 21-03 17.00%19-01 16.00% 22-01 18.00%

23-03 20.00%sequential 02-01 19.20% 02-02 13.00% 02-03 14.00% random1

06-03 12.00% 06-03 14.40% 06-01 15.20% random205-08 27.00% 05-07 18.00% 05-09 20.00% random307-03 18.00% 08-01 13.00%09-01 18.00% 11-02 18.95%10-02 34.00% 12-03 17.00%

shortBlocks 13-02 21.00% 14-02 17.00%15-03 22.00% 18-01 27.00% short116-03 31.00% 21-02 18.00% short217-01 27.00% 22-03 13.00% short319-02 14.00% 23-01 15.00%20-01 30.00% short4

blocks 04-02 31.50% 04-01 37.00% 04-03 34.00% short505-01 56.00% 05-03 36.25% 05-02 42.00%01-01 72.80%07-01 52.00% 08-02 68.00%09-02 55.00% 11-03 52.73%10-01 41.00% 12-02 49.00%13-01 44.00% 14-03 41.00%15-02 33.00% 18-03 35.00%16-01 53.00% 21-01 64.00%17-02 61.75% 22-02 34.29%19-03 20.00% 23-02 30.48%20-02 50.00%

short long randomizedBreak Type

Figure B.4: Overview over the recognition rates of all recordings. The blue data isfrom those recordings where the break type was varied, the yellow and orange datais from the variation of word orders (yellow: blocks, randomized, sequential; orange:blocksReordered, randomized, shortBlocks)

68 B. Data

Temporal Closeness and Recognition

Blocks Alphabetical

Hypo. 1 Hypo. 2 Hypo 3. Hypo. 4 Hypo. 5Ref. 1 124 49 19 12 7Ref. 2 59 70 51 17 12Ref. 3 16 24 94 34 40Ref. 4 10 26 46 88 41Ref. 5 8 12 41 49 101

Blocks Reordered

Hypo. 1 Hypo. 2 Hypo 3. Hypo. 4 Hypo. 5Ref. 1 119 57 21 14 10Ref. 2 42 96 37 19 25Ref. 3 26 58 70 41 26Ref. 4 20 26 30 80 66Ref. 5 2 24 14 49 133

Randomized

Hypo. 1 Hypo. 2 Hypo. 3 Hypo. 4 Hypo. 5Ref. 1 51 63 78 81 56Ref. 2 63 64 62 78 57Ref. 3 62 72 67 74 52Ref. 4 58 65 56 77 72Ref. 5 53 68 72 72 63

Figure B.5: Summed confusion matrices for all sessions with (alphabetical) blocks,reorderedBlocks and randomized word order (the latter only for subjects s ∈{07, ..., 17}). Hypo.1 and Ref.1 were the first hypothesis and reference timewise,while Hypo.5 and Ref.5 were the last.

69

CrossMode Testing

Subject 04

Session on which was tested blocks (01) blocks (02) blocks (03)blocks (01) 37.00% 22.00% 20.00%blocks (02) 30.30% 31.50% 19.19%blocks (03) 20.00% 22.00% 34.00%

Subject 05

Session on which was tested blocks (01) blocks (02) blocks (03)blocks (01) 56.00% 17.00% 18.00%blocks (02) 20.00% 42.00% 32.00%blocks (03) 20.20% 22.22% 36.25%

Subject 08

Session on which was tested blocks randomized sequentialblocks 68.00% - -randomized 24.00% 18.00% -sequential 20.00% - 13.00%

Subject 09

Session on which was tested blocks randomized sequentialblocks 55.00% - -randomized 21.15% 21.25% -sequential 22.00% - 18.00%

Subject 16

Session on which was tested blocks randomized shortBlockblocks 53.00% - 17.00%randomized 17.00% 16.00% 19.00%shortBlock 22.00% - 31.00%

Subject 17

Session on which was tested blocks randomized shortBlockblocks 61.75% - -randomized 21.00% 18.00% -shortBlock 24.00% - 27.00%

Session on which was trained






Figure B.6: Data of cross session experiments. The HMM was trained on one sessionand tested on another. For subjects 08, 09, 16 and 17, the HMM was trained withthe block recordings and tested on other word orders. For subjects 04 and 05, onlyblocks had been recorded (3 sessions) which is why the HMM was always trainedand tested on blocks but still across sessions.

70 B. Data

session 1HS.1GMM 3HS.1GMM 5HS.1GMM 10HS.1GMM 15HS.1GMM08-01 16.00% 17.00% 13.00% 18.00% 9.00%08-02 72.00% 69.00% 68.00% 66.00% 71.00%08-03 22.00% 17.00% 18.00% 22.00% 24.00%09-01 19.00% 20.00% 18.00% 19.00% 19.00%09-02 55.00% 58.00% 55.00% 46.00% 48.00%09-03 19.00% 18.00% 21.25% 22.00% 22.00%16-01 47.00% 52.00% 53.00% 42.00% 44.00%16-02 27.00% 18.00% 16.00% 25.00% 18.00%16-03 38.00% 31.00% 31.00% 27.00% 29.00%18-01 32.00% 34.00% 27.00% 16.00% 27.00%13-02 23.00% 21.00% 21.00% 27.00% 19.00%12-03 11.00% 20.00% 17.00% 14.00% 18.00%



Figure B.7: Recognition rates for different HMMs. HS stands for the number ofHMM states, GMM stands for the number of Gaussians per states. The best recog-nition rate that was achieved for a given session is marked in red font. The recogni-tion rates written in bold font give the best recognition rate achieved for the givenGMM.

71

Viterbi Analyse

Session Session (word order) all data no 0 scores best 16 best 1409-01 09-01 (sequential) 18.00% 19.47% 20.00% 20.00%09-02 09-02 (blocks) 55.00% 49.00% 40.00% 52.86%09-03 09-03 (randomized) 21.25% - 23.75% 18.57%13-02 13-02 (shortBlocks) 21.00% 27.37% 23.75% 22.86%

21-02 shortBlocks 18.00% 31.25%18-03 block 35.00% 53.33%

Recognition Rate

Figure B.8: Recognition rates for subsets of data (choice based on Viterbi score).’All data’ means that the whole recording was used, ’no 0-scores’ means that onlyutterances with valid Viterbi scores were used, ’best 16’ and ’best 14’ means thatthe best 16 and 14 utterances were chosen, respectively (based on the Viterbi score).

Frame Analyse

FramesSessiontype all data Std Dev # average09-01 sequ 18.00% 155.57 257.4909-02 block 55.00% 163.78 316.9609-03 random 21.25% 147.65 277.9713-02 shortBlocks 21.00% 46.26 183.83

21-02 shortBlocks 18.00%18-03 block 35.00%

Standard Deviation

09-01 (sequential) 09-02 (blocks) 09-03 (randomized) 13-02 (shortBlocks)alpha 143.90 168.02 147.91 41.68bravo 183.99 170.82 165.16 42.19charlie 146.25 168.55 124.94 25.92delta 139.60 162.91 154.28 42.28echo 143.18 140.12 137.71 63.62all 155.57 163.78 147.65 46.26

Average Number of Frames

09-01 (sequential) 09-02 (blocks) 09-03 (randomized) 13-02 (shortBlocks)alpha 231.37 342.79 275.63 189.00bravo 285.16 323.53 284.21 174.16charlie 305.37 323.74 267.95 198.00delta 240.11 280.05 255.47 166.63echo 225.47 314.68 306.58 191.37all 257.49 316.96 277.97 183.83

Figure B.9: Average number of frames and standard deviation for the different words(sessions 09-01, 09-02, 09-03, 13-02)

72 B. Data

Bibliography

[1] 10-item Edinburgh Inventory Questionnaire.http://homepage.rub.de/Martin.Dresler/edinburgh.html.

[2] Condor Cluster. http://www.cs.wisc.edu/condor/.

[3] Becker, K.: VarioPortTM-Gebrauchsanweisung, 2004.

[4] Birbaumer, N.: The Thought Translation Device (TTD) for Completely Par-alyzed Patients. IEEE, 2000.

[5] Birbaumer, N., N. Ghanayim, T. Hinterberger, I. Iversen,B. Kotchoubey, A. Kbler, J. Perelmouter, E. Taub and H. Flor:A spelling device for the paralysed. Nature, 398:297298, 1999.

[6] Blankertz, B.: Universal Access in HCI, Part II, volume 4555 of LNCS,chapter A note on brain actuated spelling with the Berlin Brain-ComputerInterface, page 759768. Springer, Berlin Heidelberg, 2007.

[7] Blankertz, B., G. Dornhege, M. Krauledat, K.-R. Mueller,V. Kunzmann, F. Losch and G. Curio: The Berlin Brain-Computer Inter-face: EEG-based communication without subject training. IEEE Transactionson Neural Systems and Rehabilitation Engineering, 14(2):147–152, 2006.

[8] Brennan, N.M. Petrovich, S. Whalen, D. de Morales Branco, J.P.O’shea, I.H. Norton and A.J. Golby: Object naming is a more sensitivemeasure of speech localization than number counting: Converging evidence fromdirect cortical stimulation and fMRI. Neuroimage., 37 Suppl 1:100–8, 2007.

[9] Calliess, J. P.: Further Investigations on Unspoken Speech. Interactive Sys-tems Laboratories Carnegie Mellon University, Pittsburgh, PA, USA and Insti-tut fur Theoretische Informatik Universitat Karlsruhe (TH), Karlsruhe, Ger-many, 2006.

[10] Craig, D.A. and H.T. Nguyen: Adaptive EEG Thought Pattern Classifierfor Advanced Wheelchair Control. In 29th Annual International Conference ofthe IEEE, pages 2544 – 2547. Engineering in Medicine and Biology Society,August 2007.

[11] Dornhege, G., J. del R. Millan, T. Hinterberger, D. McFarlandand K.-R. Mueller (editors): Towards Brain-Computer Interfacing. MITPress, 2007.

http://www.cs.wisc.edu/condor/

74 Bibliography

[12] Harasty, J., K. Double, G.M. Halliday, J.J. Kril and D.A.McRitchie: Language-associated cortical regions are proportionally larger inthe female brain. Archives in Neurology, 54.

[13] Hickok, G.: The Neuroscience of Language. Lecture Notes.

[14] Hill, H., F. Ott, C. Herbert and M. Weisbrod: Response Executionin Lexical Decision Tasks Obscures Sex-specific Lateralization Effects in Lan-guage Processing: Evidence from Event-related Potential Measures during WordReading. Cerebral Cortex, 16:978–989, 2006.

[15] Honal, M.: Determining User State and Mental Task Demand From Elec-troencephalographic Data. Master’s thesis, 2005.

[16] Honal, M. and T. Schultz: Identifying User State using Electroencephalo-graphic Data. In Proceedings of the International Conference on MultimodalInput (ICMI), Trento, Italy, October 2005.

[17] Ikezawa, S., K. Nakagome, M. Mimura, J. Shinoda, K. Itoh, I. Hommaand K. Kamijima: Gender differences in lateralization of mismatch negativityin dichotic listening tasks. Int J Psychophysiol., 68(1):41–50, 2008.

[18] Jasper, H. H.: The Ten-Twenty Electrode System of the International Fed-eration. Electroencephalography and Clinical Neurophysiology. EEG Journal,(10):371375, 1958.

[19] Knecht, S., B. Draeger, M. Deppe, L. Bobe, H. Lohmann, A. Floeel,E.-B. Ringelstein and H. Henningsen: Handedness and Hemispheric Lan-guage Dominance in Healthy Humans. Brain, 123:2512–2518, 2000.

[20] Lavie, A., A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda,T. Zeppenfeld and P. Zhan: JANUS-III: Speech-to-Speech Translation inMultiple Languages. In ICASSP ’97: Proceedings of the 1997 IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97)-Volume 1, page 99, Washington, DC, USA, 1997. IEEE Computer Society.

[21] Lotte, F., M. Congedo, A. Lecuyer, F. Lamarche and B. Arnaldi:A review of classification algorithms for EEG-based braincomputer interfaces.Journal of Neural Engineering, 4:R1–R13, 2007.

[22] Maier-Hein, L.: Speech Recognition Using Surface Electromyography. Mas-ter’s thesis, Universitat Karlsruhe (TH), Germany, 2005.

[23] Maier-Hein, L., F. Metze, T. Schultz and A. Waibel: Session Indepen-dent Non-Audible Speech Recognition Using Surface Electromyography. In Proc.ASRU, 2005.

[24] Malina, T., A. Folkers and U. G. Hofmann: Real-time EEG process-ing based on Wavelet Transformation. In 12th Nordic Baltic Conference onBiomedical Engineering and Medical Physics, Reykjavik, June 2002.

Bibliography 75

[25] Neuper, C., G. R. Mueller, A. Kuebler, N. Birbaumer andG. Pfurtscheller: Clinical application of an EEG-based braincomputer in-terface: a case study in a patient with severe motor impairment. Clinical Neu-rophysiology, 114:399–409, 2003.

[26] Nijholt, A., D. Tan, G. Pfurtscheller, C. Brunner, J.D. Mil-lan, B. Allison, B. Graimann, F. Popescu, B. Blankertz and K.-R.Mueller: Brain-computer interfacing for intelligent systems. IEEE intelligentsystems, 23:72 79, 2008.

[27] Novak, D., D. Cuesta-Frau, T. Al ani, M. Aboy, R. Mico andL. Lhotska: Speech recognition methods applied to biomedical signals pro-cessing. In 26th Annual International Conference of the IEEE, volume 1, pages118 – 121. Engineering in Medicine and Biology Society

”2004.

[28] Nunez, P. L.: Electric Fields of the Brain: the Neurophysics of EEG. OxfordUniversity Press, 1981.

[29] Oldfield, R.C.: The assessment and analysis of handedness: the EdinburghInventory. Neuropsychologia, 1(9):97–113, 1971.

[30] Petersen, S.E., P.T. Fox, M.I. Posner, M. Mintun and M.E. Raichle:Positron emission tomographic studies of the cortical anatomy of single-wordprocessing. Nature, 331:585–589, 1988.

[31] Ramachandran., V. S.: Encyclopedia of the Human Brain. Academic Press,2002.

[32] Scherer, R., G.R. Mueller, C. Neuper, B. Graiman andG. Pfurtscheller: An synchronously controlled EEG-based virtual keyboard:Improvement of the spelling rate. IEEE Transactions on Neural Systems andRehabilitation Engineering, 51(6):979984, 2004.

[33] Schmidt, R.F. and G. Thews (editors): Physiologie des Menschen. Springer,1997.

[34] Suppes, P., Z.-L. Lu, J. Epelboim and B. Han: Invariance between sub-jects of brain wave representations of language. Proc. Natl. Acad. Sci. USA,96:1295312958, October 1999.

[35] Suppes, P., Z.-L. Lu and B. Han: Brain Wave Recognition of Words. Proc.Natl. Acad. Sci. USA, 94:14965–14969, December 1997.

[36] Toga, A.W., Maziotta J.C.: Brain Mapping, 2.ed. Academic Press, 2002.

[37] Trimmel, M.: Angewandte und Experimentelle Neuropsychophysiologie.Springer-Verlag, 1990.

[38] Waibel, A., M. Bett, F. Metze, K. Ries, T. Schaaf, T. Schultz,H. Soltau, Y. Hua and K. Zechner: Advances in automatic meeting recordcreation and access. In ICASSP ’01: Proceedings of the 2001 IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing (ICASSP ’01)-Volume 1, volume 1, pages 597–600, 2001.

76 Bibliography

[39] Wand, M.: Wavelet-based Preprocessing of Electroencephalographic and Elec-tromyographic Signals for Speech Recognition. Studienarbeit Lehrstuhl Prof.Waibel Interactive Systems Laboratories Carnegie Mellon University, Pitts-burgh, PA, USA and Institut fur Theoretische Informatik Universitat Karlsruhe(TH), Karlsruhe, Germany, June 2007.

[40] Wester, M.: Unspoken Speech - Speech Recognition Based On Electroen-cephalography. Master’s thesis, Lehrstuhl Prof. Waibel Interactive SystemsLaboratories Carnegie Mellon University, Pittsburgh, PA, USA and Institutfur Theoretische Informatik Universitat Karlsruhe (TH), Karlsruhe, Germany,2006.

[41] Williams, S. M.: Handedness Inventories: Edinburgh Versus Annett. Neu-ropsychology, 5 (1):43–48, 1991.

[42] Wolpaw, J. R., N. Birbaumer, D.J. McFarland, G. Pfurtschellerand T.M. Vaughan: Brain-computer interfaces for communication and con-trol. Clinical Neurophysiology, 113(6):767791, 2002.

[43] Wolpaw, J. R., D. J. McFarland, T. M. Vaughan and G. Schalk: TheWadsworth Center brain-computer interface (BCI) research and developmentprogram. IEEE Transactions on Neural Systems and Rehabilitation Engineer-ing, 11(2):207–207, 2003.