Uniwersytet im. Adama Mickiewicza w Poznaniu 

Wydział Neofilologii 

Instytut Językoznawstwa

Communicative Alignment of Synthetic Speech

Jolanta Bachan

Rozprawa doktorska

Opiekun naukowy:

prof. UAM dr hab. inż. Grażyna Demenko

Poznań 2011


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Acknowledgements

I wholeheartedly thank Professor Grażyna Demenko for her help, support and supervision 

of my work over the years of cooperation. I would like to thank Professor Piotra Łobacz 

for her continuing support, belief and trust in me. I would like to extend special thanks to 

Professor Dafydd Gibbon for his invaluable help and discussions about my work. Further 

thanks to Professor Maciej Karpiński for providing his dialogue corpus and teaching me 

about  human-computer  interaction  and dialogue  analysis  in  various  classes.  Thanks  to 

Professor Władysław Zabrocki for being always helpful in solving problems connected 

with the thesis academic procedures.

I would also like to thank my colleagues and all the students, friends and relatives who 

willingly took part in my experiments. Without their input, my work would not have been 

possible.

Further  thanks  to  the  Bielefeld  University,  the  Kulczyk  Family  Foundation,  the 

Scholarship  Foundation  of  Professor  Władysław  Kuraszkiewicz  and  the  International 

Speech Communication Association for awarding scholarships to me, which helped me to 

focus on my academic work and develop my scientific interests.

And finally, immeasurable thanks to my parents and my brother for their unfailing 

support and love for me.

The research presented in this thesis was partly carried out within the scope of the 

research grant no. N N104 11 98 38 received from the Minister of Science and Higher 

Education. 

Badania przedstawione w pracy zostały częściowo zrealizowane w ramach  grantu 

promotorskiego nr N N104 11 98 38 przyznanego przez Ministra  Nauki i  Szkolnictwa 

Wyższego.

i


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Table of Contents
Acknowledgements ............................................................................................................... i

Index of Tables .................................................................................................................. viii

Index of Figures .................................................................................................................. xi

Chapter 1:   Introduction ...................................................................................................... 1

1.1 Objectives of the thesis .............................................................................................. 1

1.2 Motivation of the thesis ............................................................................................. 2

1.3 Alignment and accommodation ................................................................................. 4

1.4 Modelling dialogue .................................................................................................... 6

1.5 Contributions of the present research ........................................................................ 8

1.6 Overview ................................................................................................................... 9

Chapter 2:   Alignment – critical overview ........................................................................ 10

2.1 Chapter overview ..................................................................................................... 10

2.2 Basic alignment models ........................................................................................... 10

2.2.1 Alignment as a social phenomenon  ................................................................  11

2.2.2 Alignment as Audience Design  ......................................................................  12

2.2.3 Alignment as Priming  ....................................................................................  12

2.2.4 Alignment as Inter-level Interaction  ..............................................................  13

2.2.5 Alignment in human-computer interaction  ....................................................  14

2.2.6 Alignment, coordination, and situation models  .............................................  15

2.2.7 Levels of alignment  ........................................................................................  17

2.2.8 Error trapping with misalignment  ..................................................................  19

2.3 Communicative signs: function and processing ...................................................... 19

2.3.1 Levelt & Schriefers’s ‘sign pie’  .....................................................................  19

2.3.2 The revised Interactive Alignment model of dialogue processing  .................  23

2.4 Speech acts and dialogue acts .................................................................................. 25

2.5 Summary .................................................................................................................. 28

Chapter 3:   Dialogue modelling ........................................................................................ 29

3.1 Dialogue systems ..................................................................................................... 33

3.2 Dialogue system components .................................................................................. 34

3.3 Spoken dialogue systems ......................................................................................... 35

ii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

3.4 Human-computer interaction ................................................................................... 35

3.5 Summary .................................................................................................................. 36

Chapter 4:   Corpus linguistic study of dialogue interaction .............................................. 37

4.1 Chapter overview ..................................................................................................... 37

4.2 Aim of the corpus linguistic study ........................................................................... 37

4.3 Speech material - PoInt corpus ................................................................................ 38

4.4 Annotation ............................................................................................................... 39

4.4.1 Annotation procedure  .....................................................................................  39

4.4.2 Dialogue act annotation  .................................................................................  40

4.4.3 Phonemic annotation  ......................................................................................  42

4.4.4 Processing of annotations for dialogue analysis  ............................................  42

4.4.5 Notes on material preparation  ........................................................................  43

4.5 Time structure of the dialogue ................................................................................. 44

4.6 Most frequent dialogue act sequences ..................................................................... 45

4.6.1 Dialogue initiation  .........................................................................................  45

4.6.2 Dialogue termination  .....................................................................................  45

4.6.3 Turns  ..............................................................................................................  45

4.7 Frequency of dialogue acts ...................................................................................... 46

4.7.1 Dialogue flow  .................................................................................................  56

4.7.2 Overlapping speech  ........................................................................................  56

4.7.3 Non-overlapping speech  ................................................................................  60

4.8 Conclusions ............................................................................................................. 66

Chapter 5:   Modelling dialogue sequences with finite automata ...................................... 67

5.1 Chapter overview ..................................................................................................... 67

5.2 Automaton models ................................................................................................... 67

5.3 First steps in realistic automaton creation ............................................................... 68

5.4 Generalisations over finite regular languages .......................................................... 70

5.4.1 Prefix generalisations  .....................................................................................  70

5.4.2 Suffix generalisations  .....................................................................................  74

5.5 Generalisations over non-finite regular languages .................................................. 76

5.5.1 Local generalisations  .....................................................................................  76

5.5.2 Non-local generalisations  ...............................................................................  76

iii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

5.6 Turn automata .......................................................................................................... 77

5.7 Evaluation of dialogue act automata ........................................................................ 80

5.7.1 General evaluation criteria  .............................................................................  80

5.7.2 NDFST interpreter online tool  .......................................................................  80

5.7.3 Evaluation results  ...........................................................................................  82

5.8 Loop-free automata evaluation ................................................................................ 83

5.9 Iterative automata .................................................................................................... 85

5.10 Further issues: dialogue flow and alignment ......................................................... 86

5.10.1 Generalised turn automaton at time line  ......................................................  89

5.11 Summary ................................................................................................................ 93

Chapter 6:   Speech synthesis module ................................................................................ 94

6.1 Chapter overview ..................................................................................................... 94

6.2 The role of speech synthesis .................................................................................... 94

6.3 Synthesis experiment with corpus linguistic analysis ............................................. 96

6.3.1 MBROLA micro-voice creation  ....................................................................  96

6.4 Automatic Close Copy Speech synthesis ................................................................. 97

6.5 MBROLA full voice creation .................................................................................. 99

6.5.1 MBROLA data flow architecture  ...................................................................  99

6.5.2 Corpus specification  .......................................................................................  99

6.5.3 Text corpus creation  .....................................................................................  101

6.6 The Mbrolator software ......................................................................................... 103

6.7 The phone and diphone sets ................................................................................... 103

6.7.1 Phoneme set  .................................................................................................  103

6.7.2 Diphone set  ..................................................................................................  105

6.7.3 Search for diphones  ......................................................................................  105

6.7.4 Annotation of the original synthesis corpus  .................................................  107

6.7.5 Annotation file format  ..................................................................................  107

6.7.6 Search procedure in available diphone database  ..........................................  110

6.7.7 Diphone search in synthesis text and online.  ...............................................  111

6.8 Phonetically rich sentence extractor ...................................................................... 112

6.8.1 Diphone set creation  .....................................................................................  112

6.8.2 Available text resources  ................................................................................  113

iv


Communicative Alignment of Synthetic Speech – Jolanta Bachan

6.9 Software ................................................................................................................. 113

6.9.1 Sentence extraction procedure  .....................................................................  113

6.9.2 Results of sentence extraction  ......................................................................  113

6.9.3 Automatic diphone extraction system architecture  ......................................  114

6.9.4 Automatic diphone extraction system design  ...............................................  115

6.9.5 Automatic diphone extraction system implementation  ................................  117

6.9.6 BLF to TextGrid conversion  ........................................................................  117

6.9.7 PE-SAMPA TextGrid to SAMPA TextGrid conversion  ...............................  118

6.9.8 Find all diphones in TextGrid files  ..............................................................  121

6.9.9 Diphone extraction  .......................................................................................  122

6.9.10 Evaluation of the automatically extracted diphones  ..................................  124

6.9.11 Generate TextGrids for diphones  ...............................................................  124

6.9.12 Concatenate diphones  ................................................................................  125

6.9.13 PL2 synthetic Polish male voice evaluation  ..............................................  127

6.10 Summary .............................................................................................................. 131

Chapter 7:   Dialogue corpus for demonstration prototype .............................................. 132

7.1 Chapter overview ................................................................................................... 132

7.2 Corpus design ........................................................................................................ 132

7.2.1 Prompt speech material and the recording scenarios  ...................................  133

7.2.2 Subjects  ........................................................................................................  134

7.2.3 Recordings  ...................................................................................................  135

7.3 Implementation ...................................................................................................... 137

7.3.1 Creation of maps  ..........................................................................................  137

7.3.2 Creation of diapixes  .....................................................................................  138

7.3.3 Reading task  .................................................................................................  140

7.3.4 Instruction to the subjects  ............................................................................  141

7.3.5 Recording scenario  .......................................................................................  142

7.4 Corpus creation ...................................................................................................... 144

7.5 Corpus annotation .................................................................................................. 148

7.5.1 General analysis of the corpus  .....................................................................  153

7.5.2 Analysis of the selected dialogue  .................................................................  154

7.5.3 Duration analysis: the nPVI index  ...............................................................  156

v


Communicative Alignment of Synthetic Speech – Jolanta Bachan

7.6 Prototype dialogue synthesis ................................................................................. 158

7.6.1 Diphone extraction for prototype MBROLA micro-voices  .........................  158

7.6.2 ACCS synthesis of the dialogue  ...................................................................  159

7.6.3 ACCS synthesis of the filled pauses “yyy”  ..................................................  160

7.7 Finite State Transducer model of the map ............................................................. 162

7.8 Summary ................................................................................................................ 171

Chapter 8:   Demonstration dialogue system .................................................................... 172

8.1 Overview ............................................................................................................... 172

8.2 Requirement specifications .................................................................................... 172

8.3 Design .................................................................................................................... 174

8.3.1 The street map and data elicitation  ..............................................................  174

8.4 Implementation ...................................................................................................... 178

8.4.1 Implemented utterances  ...............................................................................  183

8.5 Evaluation .............................................................................................................. 185

8.6 Results ................................................................................................................... 188

8.7 Summary ................................................................................................................ 193

Chapter 9:   Summary and conclusions ............................................................................ 194

Bibliography . ................................................................................................................... 197

Software ............................................................................................................................ 205

Appendix A Dialogue act matrix ...................................................................................... 206

Appendix B Loop-free automata for speaker 1 ................................................................ 208

Appendix C Reduction of multi-layered labels ................................................................ 220

Appendix C.1 Speaker 1 .............................................................................................. 220

Appendix C.2 Speaker 2 .............................................................................................. 221

Appendix D Generalisation tables .................................................................................... 223

Appendix D.1 Prefix generalisation table for speaker 1 .............................................. 223

Appendix D.2 Prefix generalisation table for speaker 2 .............................................. 224

Appendix D.3 Suffix generalisation table for speaker 1 .............................................. 225

Appendix D.4 Suffix generalisation table for speaker 2 .............................................. 226

Appendix E Semi-coupled automata for speaker 1 and speaker 2 ................................... 228

Appendix F Loop-free automata ...................................................................................... 230

Appendix F.1 Loop-free automata for speaker 1 ......................................................... 230

vi


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Appendix F.2 Loop free automata for speaker 2 ......................................................... 231

Appendix G Iterative automata ........................................................................................ 233

Appendix G.1 Iterative automata for speaker 1 ........................................................... 233

Appendix G.2 Iterative automata for speaker 2 ........................................................... 234

Appendix G.3 Generalised automata for speaker 1 ..................................................... 237

Appendix H Automata evaluation .................................................................................... 239

Appendix H.1 Generalised automata ........................................................................... 239

Appendix H.2 Semi-coupled automata ........................................................................ 241

Appendix I Phonetically rich sentence extractor .............................................................. 244

Appendix J Automatic diphone extractor – scripts .......................................................... 251

Appendix J.1 BLF2TextGrid converter ....................................................................... 251

Appendix J.2 extendedPL2PL1 TextGrid converter .................................................... 255

Appendix J.3 Find diphones ........................................................................................ 260

Appendix J.4 Cut out individual diphones .................................................................. 264

Appendix J.5 Generate TextGrids for diphones ........................................................... 268

Appendix J.6 Concatenate diphones ............................................................................ 271

Appendix K Text material used for the Polish MBROLA voice creation ........................ 276

Appendix K.1 Phonetically rich sentences .................................................................. 276

Appendix K.2 Word list ............................................................................................... 286

Appendix L Perception test sentences .............................................................................. 288

Appendix L.1 Test 1 ..................................................................................................... 288

Appendix L.2 Test 2 ..................................................................................................... 289

Appendix M Map task: emergency scenario .................................................................... 290

Appendix M.1 The map for the leading person ........................................................... 290

Appendix M.2 The map for the following person ....................................................... 291

Appendix N Map task: neutral scenario  .......................................................................... 292

Appendix N.1 The map for the leading  person ........................................................... 292

Appendix N.2 The map for the following person ........................................................ 293

Appendix O Draw wavform, pitch and annotation for stereo sounds – Praat script ........ 294

Appendix P Demonstration dialogue system script .......................................................... 296

vii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Index of Tables
Table 1: Dialogue excerpt with lexical alignment ................................................................ 5

Table 2: Processing modules in speech generation and their relation to phases of lexical 

access (Levelt & Schriefers 1987: 398) ...................................................................... 22

Table 3: Abbreviation of dialogue act functions ................................................................ 41

Table  4:  Basic  statistics  of  the  studied  material;  N  –  number  of  sequences,  

n ≤ 2 – number of sequences with the length of one or two dialogue acts ................. 45

Table 5: Dialogue act length ............................................................................................... 47

Table 6: Frequency of different dialogue acts in the whole dialogue for both speakers .... 48

Table  7:  Number  of  dialogue  acts  at  the  beginning  (S)  and  end  (E)  of  dialogue  act 

sequences in a turn, and single turns (M) build by one utterance; o - open meeting, s - 

social communication management ............................................................................ 49

Table 8: Number of different dialogue acts at the beginning of a sequence in a  turn ....... 50

Table 9: Dialogue acts at the beginning of a turn for speaker 1 and speaker 2 .................. 51

Table  10:  Number  of  different  dialogue  acts  at  a  single-utterance  turn,  with  time 

measurements; Dur – duration, Avg – average length ................................................ 51

Table 11: Dialogue acts of single-utterance turns for speaker 1 and speaker 2 .................. 53

Table 12: Number of different dialogue acts at the end of a sequence in a  turn ............... 54

Table 13: Dialogue acts at the end of a turn for speaker 1 and speaker 2 .......................... 56

Table 14: Overlapping dialogue acts: spk 2 starts talking before spk 1 has finished ......... 58

Table 15: Overlapping dialogue acts: spk 1 starts talking before spk 2 has finished ......... 59

Table 16: Non-overlapping dialogue acts: spk 2 starts talking after spk 1 has finished .... 61

Table 17: Non-overlapping dialogue acts: spk 1 starts talking after spk 2 has finished .... 62

Table 18: Normalised difference of speakers’ speech at different categories. ID – ID of the 

dialogue chunk (position in dialogue), Dur – speech duration ................................... 64

Table 19: Difference between the main categories ............................................................. 64

Table 20: Excerpt of table with loop-free automata for each sequence of dialogue acts for 

speaker 2  .................................................................................................................... 69

Table 21: Examples of reduction of multi-layered labels to one-layered labels for speaker 2 

sorted alphabetically. ID – ID of the automaton ......................................................... 70

Table 22: A fragment of the prefix generalisation table for speaker 2 ............................... 71

viii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Table 23: Initial dialogue acts in sequences for each of the speakers ................................ 73

Table 24: Most frequent two dialogue acts at the beginning of a part for speaker 1 and 

speaker 2.  ................................................................................................................... 73

Table 25: Loop-free automata combining sequences with the same prefix. ...................... 74

Table 26: Suffix generalisation table for speaker 2. M – match ......................................... 75

Table 27: Loop-free automata and their counterparts with loops for speaker 2. ................ 77

Table 28: Fragment of evaluation table of loop-free automata. for speaker 1 ................... 84

Table 29: An evaluation table of iterative automaton for speaker 1. .................................. 85

Table 30: Extended SAMPA phoneme labels used for annotation (Demenko et al. 2003) ..

............................................................................................................................... ... 100

Table 31: Polish SAMPA transcription used in the PL1 Polish female MBROLA voice 

(Szklanny & Marasek 2002) ..................................................................................... 104

Table 32: Mismatches between BLF and PL1 SAMPA ................................................... 104

Table 33: Fragment of BLF file input resource. ............................................................... 108

Table 34: The format of an interval in TextGrid file ........................................................ 118

Table 35: The mapping table of PE-SAMPA set onto SAMPA set ................................... 119

Table 36: The phones [c] and [J] from the BLF SAMPA annotation convention and their 

equivalents in the PL1 diphone database. ................................................................. 120

Table 37: Different transcriptions of the word “kiedy” .................................................... 120

Table 38: The DIPH file format with exemplar three lines from a DIPH file .................. 121

Table 39: Diphone label normalisation table .................................................................... 122

Table 40: The SEG file format with three examplar lines from the SEG file. ................. 123

Table  41:  Results  for  Test  1  –  average  correctly  recognised  words  in  predictable  and 

unpredictable sentences. N – number of words ........................................................ 130

Table 42: Test results for Test 2. MOS/5 – Mean Opinion Score out of 5, STDV – standard 

deviation, Max:Min scores given by subjects ........................................................... 131

Table 43: Pros & cons using either the telephone or the skype call  for communication 

between interlocutors ................................................................................................ 137

Table 44: Difference between diapixes from the emergency scenario ............................. 138

Table 45: Difference between diapixes from the neutral scenario ................................... 139

Table 46: Data of the corpus recording. Age diff – stands for age difference between the 

interlocutors counted as B’s age – A’s age. ............................................................... 147

ix


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Table 47: Dialogue acts frequencies and their statistics used for dialogue annotation. N is 

the number of DA ..................................................................................................... 151

Table 48: Dialogue statistics of emergency dialogue (pair  ID: 12). Total dialogue duration 

156.49sec .................................................................................................................. 154

Table 49: Special events frequencies ................................................................................ 156

Table 50: Min, Max and Mean (M) pitch values (F0) for Speaker A and Speaker B across 

the five recording tasks. ............................................................................................ 156

Table 51: nPVI for duration of phones, syllables and pitch values of filled pauses (“yyy”). 

N is number of items ................................................................................................. 158

Table 52: Diphone manual selection process ................................................................... 159

Table 53: Utterance exchange in the emergency map task dialogue ................................ 164

Table 54: Transitions of FSA designed for the dialogue system. ..................................... 176

Table 55: Informal and formal utterances and their English translations available to the 

dialogue system ......................................................................................................... 184

Table 56: General data of people who participated in the dialogue system evaluation . . . 186

Table  57:  Questionnaire  of  assessment  of  7  areas  of  the  dialogue  system  and  their 

correspondence to the dialogue system domains ...................................................... 187

Table 58: Dialogue reconstruction based on one log file entry for informal speech style  

................................................................................................................................. ..188

Table 59: Basic statistics of functional testing of the dialogue system ............................ 189

Table 60: Results of the judgement testing of the dialogue system in 7 categories. Numbers 

in brackets stand for average assessment across the 7 categories and 2 scenarios for 

females (F), males (M) and overall (All) .................................................................. 191

Table 61: Explenation of abbrieviations of dialogue act types. ........................................ 239

x


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Index of Figures
Figure 1: Simplified architecture of a spoken dialogue system.............................................4

Figure 2: The Saussurean sign model...................................................................................19

Figure 3: Levelt & Schriefers's ‘sign pie’ (1987:396)..........................................................20

Figure  4:  Levelt  &  Schriefers  image  of  the  activation  of  a  linguistic  sign  in  speech 

production (Levelt & Schriefers 1987: 396).................................................................20

Figure 5: An outline of lexical access in speech production (Levelt 1992: 4).....................22

Figure  6:  Schematic  representation  of  the  stages  of  comprehension  and  production 

processes according to the interactive alignment model (Pickering & Garrod 2004: 

176)...............................................................................................................................23

Figure  7:  Schematic  representation  of  the  stages  of  comprehension  and  production 

processes  according to  the  autonomous  transmission  account  (Pickering  & Garrod 

2004: 177).....................................................................................................................25

Figure 8:  A model of human-computer interaction (Schomaker et al. 1995, from Gibbon, 

Mertins & Moore 2000)................................................................................................36

Figure 9: The Praat window displaying the stereo speech signal of the dialogue with its 

annotation tiers..............................................................................................................39

Figure 10: Temporal sequences and overlaps in a dialogue.................................................44

Figure 11: Percentage representation of frequency of dialogue acts at the initial position in 

a turn.............................................................................................................................50

Figure 12: Number of different dialogue acts at the beginning of a sequence in a  turn.....50

Figure 13: Percentage representation of frequency of dialogue acts in single-utterance turns

......................................................................................................................................52

Figure 14: Number of different dialogue acts at a single-utterance turn..............................52

Figure 15: Percentage representation of frequency of dialogue acts at the final position in a 

turn................................................................................................................................55

Figure 16: Number of different dialogue acts at the end of a sequence in a turn.................55

Figure 17: Difference between the two most numerous dialogue categories.......................65

Figure 18: A basic dialogue model implemented as FSA....................................................68

Figure 19: Combined automata 2_back without loops created by suffix generalisation.....75

Figure 20: Combined automata 1_back with loops created by suffix generalisation...........77

xi


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Figure 21: A semi-coupled automaton 1 for spk1 and spk2.................................................78

Figure 22: A generalised automaton of dialogue acts for speaker 1,  the follower of the 

instructions in the map task..........................................................................................79

Figure 23: A generalised automaton of dialogue acts for speaker 2, the instructor giver in 

the map task..................................................................................................................79

Figure 24: Automaton of typical dialogue flow....................................................................86

Figure 25: An automaton generating the direction description dialogue type.....................87

Figure 26: An automaton generating the misunderstanding dialogue type..........................87

Figure 27: Generalised turn automaton................................................................................88

Figure 28: Generalised turn automaton for spk 1 with dialogue act occurrence probability

......................................................................................................................................89

Figure 29: Generalised turn automaton for spk 2 with dialogue act occurrence probability

......................................................................................................................................89

Figure 30: Linear representation of generalised turn automata for spk1 and spk 2.............90

Figure  31:  Visualisation  of  overlapping  speech  being  produced  by  generalised  turn 

automata for spk 1 and spk 2........................................................................................90

Figure 32: Integrated generalised linear 4-stage turn automata for two speakers................91

Figure 33: Integrated generalised "overlapping" 4-stage turn automata for two speakers...92

Figure 34: Mbrolation, the MBROLA micro-voice creation procedure..............................97

Figure 35: Comparison of original recording with microvoice and PL1 female voice .......98

Figure 36: Data flow chart for MBROLA voice creation and runtime synthesis.................99

Figure 37: Phonetically rich sentence extraction procedure...............................................114

Figure 38: Architecture of the automatic diphone extraction system.................................115

Figure 39: Design of the automatic diphone extraction software. PE-SAMPA – the Polish 

extended SAMPA........................................................................................................116

Figure 40: Conversion flow of text files in the automatic diphone extraction system.......117

Figure 41: Diphone WAV file with automatically generated annotation............................125

Figure 42: Diphone files ordering according to the diphone ID. ......................................126

Figure 43: Diapixes from the emergency scenario.............................................................139

Figure 44: Diapixes for the neutral scenario (adopted from Bradlow et al. 2007).............140

Figure 45: Recording setting of the dialogue corpus.........................................................142

Figure 46: MX Skype Recorder window...........................................................................143

xii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Figure 47: TimeLeft timer used for the recording of the emergency scenarios.................144

Figure 48: A person in the emergency setting at the corpus recording..............................145

Figure 49: Annotation of dialogues on speech and special tiers for each speaker.............149

Figure 50: Annotation of dialogues on several tiers for Speaker A (channel 2, bottom) and 

Speaker B (channel 1, top)..........................................................................................152

Figure 51: Dialogue acts frequency....................................................................................155

Figure 52: Speaker's A and Speaker's B waveforms, pitch contours and annotation tiers of a 

synthesised dialogue excerpt at 17.5 to 21.5 second..................................................160

Figure  53:  Examples  of  the  ACCS  synthesised  filled  pauses  for  Speaker  A (top)  and 

Speaker B (bottom).....................................................................................................161

Figure 54: (A) Emergency map with all junctions marked for selection for the FST nodes; 

(B)  Emergency  dialogue  automaton  with  the  nodes  representing  the  reachable 

junctions selected .......................................................................................................163

Figure 55: Map FST with utterance exchanges IDs...........................................................168

Figure 56: Emergency map presented to the human user for the communication scenario 

with the dialogue system.............................................................................................175

Figure 57: Map task dialogue as a basis for map traversal automaton...............................176

Figure 58: Dialogue system architecture............................................................................178

Figure 59: Dialogue manager automaton with dialogue acts.............................................179

Figure 60: Dialogue manager automaton with exemplar utterances..................................180

Figure 61:  Visualisation of the implementation of the dialogue system main algorithm..182

Figure 62: Dialogue system evaluation setting..................................................................186

Figure 63: Semi-coupled automaton 2...............................................................................228

Figure 64: Semi-coupled automaton 3...............................................................................228

Figure 65: Semi-coupled automata 4..................................................................................229

Figure 66: Generalised automaton 1 for speaker 1.............................................................237

Figure 67: Generalised automaton 2 for speaker 1.............................................................237

Figure 68: Generalised automaton 3 for speaker 1.............................................................237

Figure 69: Generalised automaton 4 for speaker 1.............................................................238

Figure 70: Generalised automaton 5 for speaker 1.............................................................238

xiii


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Chapter 1:   Introduction

1.1 Objectives of the thesis
The central  claim of  the  thesis  is  that  a  dialogue system should  be well-motivated by 

dialogue theory and by analysis of actual dialogues, and that the resulting system should be 

tested  in  a  real-world  scenario.  Based  on  this  claim,  the  thesis  concentrates  on 

methodology  and  investigates  a  wide  range  of  methods  required  for  fulfilling  these 

requirements  adequately.  The  operational  aim  is  to  provide  a  simple  proof-of-concept 

dialogue system based on the claim and combining written and spoken communication. 

The operational aim is therefore not to develop a fully functional dialogue system, but a 

prototype which illustrates the main claim and the methodology of the thesis in a simulated 

stressful emergency scenario.

The alignment theories discussed by Pickering and Garrod (2004) will be the focus of 

the present work. According to the alignment theories, alignment in dialogue takes place 

on semantic, syntactic and pragmatic levels. In the present thesis the work is focused on 

the semantic level and the thesis claim is:

Alignment of semantic representations is essential for successful communication in a 
dialogue.

The intention is to test semantic alignment both descriptively, using the dialogue act 

approach of Bunt (2000) and with two corpus linguistic studies, and operationally, with a 

finite state text-in-voice-out dialogue system which has been specially designed for this 

purpose. The finite state dialogue system uses a male Polish synthetic voice which was 

created for this application, and an innovative combination of two finite state systems: a 

finite state dialogue manager which controls a finite state map traversal system. To assure 

success  in  communication,  routines  for  recovery  from  misalignment  have  also  been 

addressed in the dialogue manager. 

The methodologies which are dealt with include:

1


Communicative Alignment of Synthetic Speech – Jolanta Bachan

1. Linguistic dialogue theories.

2. Theory-based corpus linguistic description of dialogue.

3. Dialogue modelling with automata.

4. Speech synthesis component of a dialogue system and voice creation for speech 

synthesis module and its evaluation.

5. Dialogue corpus creation and evaluation with microvoices (synthetic voices which 

only cover a restricted range of the language, for experimental purposes).

6. Dialogue system demonstration prototype and evaluation.

In order to create the demonstration prototype, the specific computational linguistic 

issues to be addressed include:

1. Dialogue design based on a formal analysis of the dialogue act in the first corpus 

linguistic study, with finite state modelling, and on a scenario-specific dialogue act 

analysis in the second corpus linguistic study. 

2. Formal-informal  speech style  selection  in  a  realistic  stress  scenario  (emergency 

dialogue with a hospital call-centre).

3. Formal properties of automaton models.

4. Information extraction from two corpora for dialogue modelling.

5. Information extraction from text and speech corpora and a speech corpus creation 

for synthetic speech voice creation.

1.2 Motivation of the thesis
In the information society people need to cooperate more and more with computer systems, 

and therefore computer systems need to be designed which make this cooperation easier. 

Typical activities such as looking for timetables on the internet, booking flights via online 

forms and changing the settings of a mobile phone in call centres are very common. The 

human user has to  follow automatic  instructions because in general  there is  no human 

operator. However, such communication systems are not natural, often the processes are 

lengthy and time-consuming, and they are always restricted to the pre-defined options of 

the system. In certain situations, when these options fail the customers are redirected to 

2


Communicative Alignment of Synthetic Speech – Jolanta Bachan

human operators as the required tasks are too complex for the system. Two main issues are 

involved here: first, the ‘intelligence’ of the system, and second, the ‘naturalness’ of the 

input-output interaction. The present study concentrates on input-output interaction with 

text-in-voice-out dialogue,  a  common configuration in commercial  information systems 

such as satellite navigation devices and screen readers for the blind.

Because talking is more natural than dialling numbers or filling in text forms, many 

institutions provide call centres where people can choose to talk about their problems or 

requests with a human operator. However, human work time is very expensive and one 

person can basically deal with just one customer at a time. Therefore in the technologies 

concerned with making input-output issues more natural much effort is being put into the 

development of  dialogue systems which can communicate with a human being via the 

speech signal and deal with more than one customer at a time (cf. the Vermobil project,  

Wahlster 2000, and the SmartKom project, Wahlster 2006). Such a dialogue system has a 

speech recognition module which receives human speech input and converts it to a form 

which  is  understandable  by  the  computer  and  produces  synthetic  speech  to  provide 

information  back  to  the  user.  The  communication  between  the  human  user  and  the 

computer system is administered by a dialogue manager which decides on the next actions 

the system should take. In addition to acoustic speech recogniser and speech synthesiser 

components,  the  system  also  includes  computational  linguistic  components  such  as  a 

machine-readable lexicon together  with a  parser  which extracts  meaning from the pre-

processed human speech, and a natural language module generation which converts the 

reply created by the dialogue manager into the natural language form. An example of a 

dialogue system architecture is shown on Figure 1.

3


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Figure 1: Simplified architecture of a spoken dialogue system

1.3 Alignment and accommodation
In recent years new aspects of communication have been investigated which are relevant 

for developing natural human-computer dialogue interaction. These include alignment of 

communication form and content between the interlocutors (Pickering & Garrod 2004) and 

accommodation of interlocutors to each other (Giles et al. 1992). It has been noticed that 

while communicating, interlocutors tend to adopt each other’s behaviour such as style of 

speaking, vocabulary, gestures.

In  the  present  context,  alignment  is  meant  here  as  adaptation  on  the  syntactic, 

semantic and pragmatic levels of communication between the two interlocutors, including 

the choice of similar lexical items and speaking style. However, it needs to be emphasised 

that the form, content and degree of alignment depends on the communication situation 

and  status  relations  between  the  interlocutors.  The  main  distinction  for  emergency 

scenarios  to  be  made  is  between alignment  in  public  and  private  situations.  In  public 

situations in which interlocutors do not know each other the degree of alignment of their 

behaviours has been found to be smaller than in face-to-face conversations between two 

close friends (Batliner et al. 2008). In fact, there may be deliberate non-alignment between 

a call-centre operator and an emotional caller, in order to calm the caller.

Table 1 presents a dialogue excerpt with an example of alignment. In the dialogue 

excerpt coming from the dialogue corpus recorded for the present study, an example of 

4


Communicative Alignment of Synthetic Speech – Jolanta Bachan

lexical  alignment  is  shown. Here Speaker  A,  while  giving instructions,  talks  about  the 

roundabout. Speaker B does not see the roundabout, so Speaker A defines it as a ʽcircular 

flower bed’. In order to be understood, because Speaker A is nervous, Speaker B adapts the 

word  ʽflower bed’ to refer to the roundabout, but then immediately uses again the regular 

word  ʽroundabout’.  Speaker  A starts  to  use  the  word  roundabout  again  appartently 

unintentionally because her focus later in the dialogue is on giving the next instructions of 

the route and does not think of the roundabout anymore.

Table 1: Dialogue excerpt with lexical alignment

Polish English

A: [route description] Przy rondzie są roboty, więc trzeba 
będzie je objechać [route description]

A: [route description] At the roundabout there are 
roadworks, so they must be passed by [route description]

B: Może Pani powtórzyć. Nie widzę tutaj ronda po drodze. B: Can you repeat. I don’t see any roundabout on the way.

A: Znaczy... rondo, to jest taki, taki okrągły kwietnik. 
[yyy] jest [yyy] między sklepem a lodziarnią.

A: It means... the roundabout, this is, such a circular 
flower bed [yyy] is [yyy] between the shop and the  ice 
cream parlor.

[route description] A: [route description]

A: Następnie objechać rondo – ten taki okrągły kwietnik. A: Then go round the roundabout – this circular flower 
bed.

B: Czyli po tym jak skręce w prawo... B: So after having turned right... 

A: Tak. A: Yes.

B: Muszę jeszcze skręcić w lewo, żeby dojechać do tego 
kwietnika.

B: I again have to turn left to get to this flower bed.

A: Tak, tak, tak. Jestem dość zdenerwowana iii iii 
wszystko... wszystko wydaje mi się takie... Przykro mi.

A: Yes, yes, yes. I’m quite nervous aaand aaand 
everything... everything seems to me so... I’m sorry.

B: Dobrze. Proszę się uspokoić. Czyli na rondzie gdzie 
muszę skręcić?

B: Good. Please, calm donw. So at the roundabout, where 
do I have to turn?

A: [yyy][um] Na rondzie musi Pani [yyy] na rondzie musi 
Pani skręcić w [y] obok lodziarni [route description]

A: [yyy] [um] At the roundabout you have to [yyy] at the 
roundabout you have to turn at [y] next to the ice cream 
parlor.

The  present  study focusses  on  basic  aspects  of  alignment  which  are  relevant  for 

human-computer  communication  in  stressful  emergency  scenarios  in  public.  In  public 

stress situations it is necessary to know the conversation is conducted in terms of formal 

and informal styles, and not what emotions, in the usual senses of the term (‘fear’, ‘anger’,  

‘sadness’,  ‘happiness’ …, ‘neutral’;  cf.  Ortony & Turner  1990,  Murry & Arnott  2008, 

Bachan  & Surmanowicz  2008),  are  expressed  in  the  interlocutors’ speech:  for  present 

purposes, negative emotions such as ‘fear’, ‘anger’, ‘sadness’ are included in the concept 

of ‘stress’. The ‘informal’ and ‘formal’ styles are more related to private versus public 

5


Communicative Alignment of Synthetic Speech – Jolanta Bachan

situations than to emotion, and both may occur in stress scenarios. These distinctions are 

taken into account in the dialogue system demonstration prototype.

If one of the interlocutors becomes involved in a difficult position and undergoes great 

stress, the interlocutor to which the stressed person talks to will try to align their speech on 

the syntactic (including lexical), semantic and pragmatic levels (Branigan et al. 2000), but 

will not try to empathise with the emotional state of their interlocutor. In the course of the 

conversation,  the interlocutors will  start  to use the same vocabulary (Brennan & Clark 

1996; Clark & Wilkes-Gibbs 1986; Wilkes-Gibbs & Clark 1992), but not necessarily both 

speaker’s voices will  start  sounding nervous because of the stress affecting one of the 

interlocutors.  However,  this  is  not  necessarily  the  case  with  a  professional  call  centre 

operator.

It is assumed that the speaking style towards a stressed person (or a person in any 

other emotional state) is different than toward a person who does not show any emotions. 

The dialogue system should be able to recognise the emotional states of its users and based 

on the  prosodic  and lexical  speech  characteristics  apply a  speech  style  which  will  be 

aligned with these emotional states (cf. Batliner et al. 2003).

1.4 Modelling dialogue
The  goals  of  the  present  investigation  include  providing  explicit  models  for  relevant 

aspects of human-human communication connected with alignment and accommodation. 

The literature on these topics does not consider ways of aligning synthetic speech with the 

human interlocutor in their interaction, focussing specifically on stressed and emotional 

speech in crisis situations, although acceptable human-computer interaction is the subject 

of much research. The models should enable appropriate speech style selection in these 

situations, based on the observations that existing models of emotion are both too simple 

and too speculative, that actors imitating crisis speech are not producing authentic crisis 

speech, and that in public stress scenarios the formal-informal style dimension is more 

relevant than emotion space.

The general working hypothesis is that it  is possible to replace traditional emotion 

label sets with a generic model of the following type (which would also apply to ‘emotion’ 

in addition to ‘style’ if required):

6


Communicative Alignment of Synthetic Speech – Jolanta Bachan

TRIGGER_SITUATION → STYLE → STYLE_MANIFESTATION

The trigger situation is the particular public stress scenario which requires a certain 

formal or informal communication style. The style manifestation is the set of syntactic, 

lexical  and phonological  conventions  which  are  associated  with  the  chosen style.  The 

specific hypothesis is that it is possible to design and implement a speech style selection 

module based on this model to drive synthesiser-interlocutor alignment, and to implement 

it  in  speech  synthesis  software.  Such  a  module  should  improve  the  naturalness  and 

efficiency  of  human-computer  communication.  In  the  spoken  dialogue  demonstration 

prototype, the styles and style manifestations are considered, but an automatic recognition 

of alternative trigger situations (age, gender, social status, task etc.) is not included since a 

specific single simulation scenario (a variety of map task with university graduate students) 

is used.

For  the  spoken  dialogue  demonstration  prototype,  the  focus  is  on  the  dialogue 

manager and speech synthesis modules.

 In human communication the interlocutors tend to align their behaviours, not only 

speech, but also gestures and body movements. The present investigation is not concerned 

with  multimodal  communication  of  this  kind;  consequently,  the  selected  scenario  is  a 

telephone-like scenario with no visual contact between interlocutors. The present study is 

also not concerned with recognising and manipulating phonetic features of speech, e.g. 

prosodic and paralinguistic features such as voice quality, intonation, rhythm and tempo of 

human speech. However, styles are also characterised by lexical items and other markers 

such as hesitation phenomena, repetitions and curses, suggesting different behavioural and 

expressive  states  of  the  interlocutor.  Based  on the  analysis  of  these  items,  a  dialogue 

system  can  generate  a  kind  of  output  which  would  be  expected  in  human-human 

communication.  These  stylistic  markers  in  human-human  communication  may  also 

indicate  that  the communication  is  not  successful;  if  a  recognition  module were to  be 

developed, situations when the system cannot understand the speaker would need to be 

modelled. In such situations the dialogue manager should select a different trigger situation 

for planning the conversation. Similarly, the dialogue manager may also apply a different 

speaking style  to  be  generated  by the  speech synthesis  module.  Such a  system would 

analyse  the  trigger  situation,  for  example,  domestic  violence,  and compare this  trigger 

7


Communicative Alignment of Synthetic Speech – Jolanta Bachan

situation with the phonetic features manifesting human emotions, for example, fear. If the 

dialogue manager finds a scenario to be used in such a situation, it applies the appropriate 

scenario and an appropriate speech style, for example a reassuring style.

1.5 Contributions of the present research
First, the Pickering and Garrod (2004) approach to alignment is criticised and modified in 

the area of semantic alignment. The first criticism is that Pickering and Garrod are not 

precise about what semantic alignment is. In the present research, two corpus linguistic 

studies are undertaken for this purpose, and in the operational system a map with certain 

unforeseeable  properties  is  used  as  a  reference  point  for  semantic  alignment,  and 

negotiation of a route through the map requires semantic alignment of different types. The 

second criticism is that Pickering and Garrod only deal with cooperative alignment. The 

present  research  does  not  deal  with  non-cooperative  alignment,  but  it  deals  with 

cooperative non-alignment to some extent, between a professional call-centre operator and 

a caller.

Second, the dialogue act approach of Bunt is criticised because in his earlier work, at 

the time of the corpus linguistic studies, the dialogue acts were simply listed abstractly, 

with no empirical illustration. A selection of Bunt’s dialogue acts was made for the purpose 

of the present research, and investigated in the corpus linguistic studies. In later work, Bunt 

(2010) added empirical information, but did not deal with scenarios such as the emergency 

calling scenario. A second criticism is that in the earlier work, and to a large extent in the 

later work, Bunt does not deal with sequences of dialogue acts, but only with a hierarchical 

classification of dialogue acts. In the present research, sequences of dialogue acts in the 

corpus and also in the operational system are modelled with finite state automata.

Third, the present research has an operational outcome, as a text-input-voice-output 

dialogue system which is intended to test the points listed above, and an evaluation of this 

system. The use of two finite state systems, one as a dialogue manager, and the other as a  

map traversal algorithm, with the dialogue manager controlling the map traversal module. 

One further original contribution in this context is the new Polish male voice PL2 for the 

MBROLA (Dutoit et al. 1996) speech synthesis system.

8


Communicative Alignment of Synthetic Speech – Jolanta Bachan

1.6 Overview
Following the introduction to the topic and the research aims presented in this chapter, in 

Chapter 2 a brief selection of relevant theoretical linguistic approaches on alignment to 

dialogue  description  and  their  implications  for  development  of  the  spoken  dialogue 

demonstration  prototype  are  discussed.  In  Chapter  3  dialogue  modelling  is  briefly 

introduced and components of dialogue systems are presented. In Chapter 4 a pilot study in 

which theoretical principles are applied to actual dialogue description is undertaken. In this 

study the  research  is  carried  on  an  existing  dialogue corpus.  Chapter  5  presents  work 

development  of  provisional  automaton  models  of  the  dialogue.  The  aim is  to  develop 

techniques and tools for dialogue modelling in the prototype dialogue system. Chapter 6 is 

concerned  with  prerequisites  for  developing a  speech synthesis  module  for  a  dialogue 

system. It presents results of diphone search in existing text and speech corpora as well as 

introduces two tools for efficient diphone database creations developed for this purpose. 

The creation of a speech corpus used for Polish male synthetic voice creation is presented 

together with evaluation of the voice. Chapters 7 and 8 present the test of the thesis claim. 

Chapter  7  is  a  corpus  linguistic  study of  the  kinds  of  alignment  in  public  emergency 

dialogues which are required for designing the spoken dialogue demonstration prototype. 

In  this  Chapter,  creation  of  dialogue  corpus  is  presented  and  prompt  materials  and 

recording  techniques  are  discussed.  The  addressed  scenarios  are  stressful  emergency 

situations and neutral dialogues based on map and diapix tasks. The development of the 

spoken dialogue demonstration prototype, including evaluation with human users, is dealt 

with in Chapter 8. The chapter presents an innovating technique combining two finite-

state-automata which work together in the dialogue system: one for map traversal, and one 

for dialogue negotiation.  Chapter 9 is concerned with the conclusions from the present 

work and tasks for the future.

Much of the empirical and technical material (materials for speech corpus recording 

scenarios, tables with results of empirical studies, automaton models of dialogue structure, 

code of software tools) is included in Appendices in order to avoid distraction from the 

main argument in the text.

9


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Chapter 2:   Alignment – critical overview

2.1 Chapter overview
The specification of a dialogue system depends partly on linguistic, psycholinguistic and 

logical  specifications  of  the  domain  of  language  in  dialogue.  The  discussion  of  these 

concepts will be very selective and brief, because relevant studies tend to be very general, 

from the point of view of speech technology and are important foundations for dialogue 

system development but not the focus of attention in the present research. In this chapter, 

the  relevant  concepts  of  ‘alignment’,  ‘coordination’,  ‘common  ground’,  ‘speech  act’, 

dialogue act’, ‘sign’, and ‘language as-product vs. ‘language-as-action’   will be discussed. 

The discussion mainly follows the approach of Pickering & Garrod (2004) and Levelt 

(1992).  The  main  thesis  of  this  chapter  is  that  alignment  in  dialogue  takes  place  on 

syntactic, lexical, semantic, and pragmatic levels of language as well as on the obvious 

levels of pronunciation and prosody of speech.

2.2 Basic alignment models
Alignment was defined in the Introduction as adaptation on the syntactic, semantic and 

pragmatic levels of communication between the two interlocutors, including the choice of 

lexical items and speaking style; the form, content and degree of alignment depend on the 

communication situation and status relations  between the interlocutors.  For the present 

investigation,  the  main  distinction  for  emergency  scenarios  is  to  be  made  between 

alignment in public situations and alignment in private situations, which affect the use of 

different  utterance  styles.  The  problem  of  emotional  alignment  is  important,  but  not 

directly relevant for communication in public situations. Even if a person calling a call-

centre is highly stressed and emotional, it is not a good idea for the call-centre response to 

use the same emotional utterance types, but the response must still be aligned on the basis 

of appropriate strategies for achieving successful communication with a stressed person.

There are several questions which must be answered clearly.

10


Communicative Alignment of Synthetic Speech – Jolanta Bachan

What  function does  alignment  have  in  communication?  For  the present  study,  the 

following function is the most important:

The general function of alignment is coordination between interlocutors in order to 
achieve a successful outcome of communication.

Alignment in dialogue is a component of communication, is a social activity, and a 

successful outcome may be defined on many different levels: alignment of pronunciation, 

alignment of vocabulary, alignment of syntax, and also alignment of descriptive semantic 

content and alignment of pragmatic functionality. Another issue is whether alignment is a 

consciously aware strategic behaviour or  a subconsciously implicit behaviour is not in the 

focus of the present study.

What  is  the  purpose  of  alignment  in  a  dialogue  system?  Alignment  is  a  kind  of 

behaviour  control  procedure  during  communicative  interaction.  People  may  use  many 

levels of alignment procedure in communication, including the language features which 

have been mentioned already, and also gestures of the face, the hands and the position of 

the body. Clark (1985) suggests that other kinds of non-linguistic coordinated activity, such 

as  dancing,  and  cooperation  on  the  same  practical  task,  may  be  subject  to  the  same 

principles of alignment.

What approaches to modelling alignment have been proposed? Pickering and Garrod 

(2004) outline four approaches which will be discussed below.

2.2.1 Alignment as a social phenomenon

As a social phenomenon, alignment in communication depends on status relations between 

the speakers and listeners, who  consider the social effect of their utterances. The principle 

of  alignment  as  a  social  phenomenon  is  that  people  want  to  communicate  politely, 

cooperatively and successfully with each other (Grice 1975; Giles et al. 1992; Allwood et 

al. 2000). It is true that there are also types of communication which are not based on 

cooperation  but  on  conflict  and  aggressiveness.  In  these  communication  scenarios 

alignment may be deliberately avoided, but in some way alignment is still a reference point 

for communication. However, in the stressful emergency dialogue scenario involved in the 

present study, successful communication will be cooperative and potentially supported by 

alignment.

11


Communicative Alignment of Synthetic Speech – Jolanta Bachan

From the point of view of dialogue system development, an exclusively social view of 

alignment  is  too  restrictive  because  it  concentrates  on  the  obvious  observation  that 

alignment is a social phenomenon. But this is incomplete: by concentrating on pragmatics, 

the  view  does  not  take  the  necessary  formal  dimensions  of  communication  such  as 

appropriate  formulation  (pronunciation,  lexicon  and  syntax),  adequate  expression  of 

content (semantics) into consideration.

2.2.2 Alignment as Audience Design

Another  model  of  alignment  considered  by  Pickering  &  Garrod  (2004)  is  the 

audience design model. In the case of audience design, the speaker chooses expressions 

most likely to be correctly understood and accepted by the listener. The aim of this is to 

enhance communication on the basis of beliefs which the speaker has about the hearer. The 

main problems with the theory about the Audience Design mechanism of alignment are:

1. From a processing point of view, the Audience Design is very complex to compute. 

Many levels  of language,  speech and interaction have to  be taken into account 

during the alignment process, involving listener modelling and inference making.

2. The Audience Design model does not provide a robust procedure, since each aspect 

of alignment depends on many assumptions which may not be true.

3. The Audience Design model does not explain the other pragmatic, social and non-

linguistic dimensions of alignment which affect the speaker.

2.2.3 Alignment as Priming

Alignment seen as Priming involves mechanisms of linguistic representation which are 

generally considered as being automatic, like other priming processes. Priming means the 

preparation of a speaker or hearer for behaving in a certain way on the basis of previous 

perception or behaviour.  In this view, Pickering & Garrod (2004) claim that alignment 

automatically falls  out  of linguistic  processing,  because priming applies to  many other 

kinds  of  linguistic  behaviour.  Pickering  & Garrod  point  out  that  this  view  offers  the 

following features:

1. Priming  is  cognitively  economical:  the  processes  involved  are  those  which  are 

involved in regular speech production.

12


Communicative Alignment of Synthetic Speech – Jolanta Bachan

2. Priming is robust: the need for detailed listener models is not present, information 

is taken from perception of the immediate context.

3. Priming explains linguistic repetitions and imitation.

4. Priming  is  computationally  less  complex  for  common  kinds  of  phonetic  and 

phonological alignment, which is very rapid and is ‘resource-free”, I.e. does not 

involve huge cognitive resources.

5. Alignment is a process which takes place below awareness levels.

Alignment is a process which does not only concern normal speakers. It also concerns 

speakers with some kinds of impairment, such as autism. In an experiment, the alignment 

of Noun Phrase structure in children was examined (Allen et al. 2011). In this experiment, 

the syntactic alignment behaviour of autistic (Autistic Spectrum Disorder, ASD) and non-

autistic  children  was  compared.  The  children  with  Autistic  Spectrum Disorder  (ASD) 

spontaneously converge, or align, syntactic structure with an interlocutor. Children with 

ASD were more likely to produce a passive structure to describe a picture after hearing 

their interlocutor use a passive structure to describe an unrelated picture when playing a 

card  game.  Furthermore,  they  converged  syntactic  structure  with  their  interlocutor  to 

approximately the same extent as did both chronological and verbal age-matched controls: 

autistic children, 24%, age-matched children – 21%, verbal-age-matched controls – 20%. 

These results suggest that the linguistic impairment that is characteristic of children with 

ASD, and in particular their difficulty with interactive language usage, cannot be explained 

in terms of a general deficit in linguistic imitation such as alignment by Priming.

The Priming point of view can also be criticised. Priming does not explain the more 

abstract levels of alignment, since it is based exclusively on the perception of linguistic 

input,  and it  does  not  account  for  functional  properties  of  alignment  in  increasing the 

chance of cooperative and successful communication.

2.2.4 Alignment as Inter-level Interaction

In the Interaction model, alignment automatically takes place at several different levels of 

language  at  the  same  time.  Pickering  and  Garrod  (2004)  consider  that  the  Interactive 

Alignment model is too strong if it is taken literally. For example, it is not always the case 

13


Communicative Alignment of Synthetic Speech – Jolanta Bachan

that alignment at one level of representation leads to alignment at other levels. Alignment 

for example at lexical level may mask an underlying misalignment at the semantic level, 

for example when ambiguity is involved: “John!” may denote John Brown or John Smith, 

for example.

The Inter-Level Interaction model will  be referred to again below. For the present 

study, the point is that the model implies that the different views of alignment may not be 

competitors. They may occur in combination as simultaneous and interacting procedures in 

a multiple mechanism composed of the described components: social behaviour, audience 

orientated,  primed,  interactive  or  all  of  these.  The  components  does  not  have  to  be 

mutually exclusive and some context may require any combination of these components, 

or all the components.

2.2.5 Alignment in human-computer interaction

In  studies  of  human-computer  interaction,  it  has  been  suggested  that  the  way humans 

interact with computers is related to beliefs about the social status of interlocutors, beliefs 

and knowledge about computers, and beliefs about the linguistic capability of interlocutors. 

It appears that there may be a lower degree of alignment when speakers are to interact with 

people  of  lower  social  status  and  more  alignment  when  the  speaker  believes  their 

interlocutors to be linguistically less capable. In human-computer interaction it seems that 

that people communicate with computers as if computers were like people who are rather 

stupid and of lower social status.

In an experiment using the Reverse Wizard of Oz scenario,  lexical alignment was 

investigated  by Branigan  (Branigan  2009,  cf.  Pearson  et  al.  2006):  83% of  alignment 

occurred when people believed they were interacting to a computer, which was the truth, 

and 44% of alignment occurred when people believed they were interacting to a human, 

which was not true as they were interacting with a computer. 

Similarly, in a second experiment, an advertisement of an older dialogue system for 

$10, and a new system from 2003 for $299, there was 80% of alignment with a basic 

version of a program, and 42% of alignment with an advanced version of the program.

These experimental results suggest that people align more with computers than with 

people,  and apparently they transfer  their  beliefs  about  people  they align  less  with  to 

14


Communicative Alignment of Synthetic Speech – Jolanta Bachan

computers:  they also align more with stupid computers than with more smart ones (or 

rather computers that they think are stupid or smart).

2.2.6 Alignment, coordination, and situation models

All of the views discussed so far leave many issues open, in particular the functionality of 

alignment: what actually is successful communication? In the following sections a number 

of issues in this area will be discussed briefly, mainly based again on Pickering & Garrod 

(2004).

According Clark (1985),  dialogue is  a joint activity and coordination is  similar  in 

other coordinated activities, such as ballroom dancing or with lumberjacks using a two-

handed saw. An obvious case which is not mentioned by Pickering & Garrod or Clark is in 

some kinds of sports such as tennis, baseball, football, boxing, wrestling.

According to another approach, coordination occurs when interlocutors share the same 

linguistic representation at some level (Branigan et al. 2000; Garrod & Anderson 1987).

Pickering and Garrod (2004) prefer to call the first case ‘coordination’ and the second 

case ‘alignment’. Alignment occurs at a particular level when interlocutors have the same 

representation at that level. So dialogue is coordinated, but also aligned. But it is not clear 

whether there are other alignment levels in the other activities which are coordinated. This 

is not discussed by Pickering & Garrod.

Pickering  & Garrod (2004)  continue  their  discussion  of  alignment  by introducing 

situation models and relating them to other alignment concepts:

1. Alignment  of  situation  models  (Zwaan & Radvansky 1998)  forms  the  basis  of 

successful dialogue.

2. The  way that  alignment  of  situation  models  is  achieved  is  by a  primitive  and 

resource-free priming mechanism.

3. The same priming mechanism for situation models produces alignment at  other 

levels of representation, such as the lexical and syntactic.

4. Interconnections  between  the  levels  mean  that  alignment  at  one  level  leads  to 

alignment at other levels.

15


Communicative Alignment of Synthetic Speech – Jolanta Bachan

5. There  is  another  primitive  mechanism allows  interlocutors  to  repair  misaligned 

representations interactively.

6. More sophisticated and potentially costly strategies that depend on modelling the 

hearer’s beliefs are only needed if  the primitive mechanisms do not  succeed in 

producing alignment.

On this  basis,  they propose their  own version of  the Interactive Alignment  account  of 

dialogue alignment.

In a dialogue system, the users are in a certain situation which has to be modelled. A 

situation model as introduced by Pickering & Garrod is described as a multi-dimensional 

representation of the situation under discussion (Johnson-Laird 1983; Sanford & Garrod 

1981; van Dijk & Kintsch 1983; Zwaan & Radvansky 1998). According to Zwaan and 

Radvansky,  the key dimensions  encoded in situation models  are  space,  time,  causality, 

intentionality, and reference to main individuals under discussion. This is clearly relevant 

for the current research.

Although  Pickering  &  Garrod  criticise  approaches  which  propose  two  situation 

models, one for the speaker and one for the hearer, because they are too complicated and 

inefficient.  But the criterion of complexity and efficiency are not clear.  For a dialogue 

system in which new information has to be communicated, this criticism is not justified. 

There are also other situations in which two models may be needed: for deception, lying, 

hiding confidential information. Therefore full alignment of the situation models may not 

be  possible.  Lack  of  alignment  also  occurs  when  misunderstandings  happen.  So 

misalignment may have to be tolerated, and error-correction mechanisms may be needed.

In the present study, the central questions will be tackled: how (or to which extent) the 

dialogue system can align with the key dimensions of the situation model, namely space, 

time, causality, intentionality, and reference to main individuals under discussion.

If the system in the emergency call centre is able to align to these dimensions with a 

high degree of accuracy, then it should be able to put the appropriate priority to the phone 

call and classify the call, as well as following instructions about the emergency location: 

this is situation model alignment. The situation model provides a set of features for the 

TRIGGER_SITUATION part of the model presented in the Introduction.

16


Communicative Alignment of Synthetic Speech – Jolanta Bachan

In an extreme case if two people are in very different associations, such as a stressed 

caller and a call-centre employee, or if two people come from different cultures and speak 

different  languages,  it  is  still  possible  for them to align their  situation models through 

explicit  negotiation  (Brennan  &  Clark  1996;  Clark  &  Wilkes-Gibbs  1986;  Garrod  & 

Anderson  1987;  Schober  1993).  According  to  Pickering  & Garrod  (2004),  the  global 

alignment of the situational models seems to result from the local alignment at the level of 

the linguistic  representations  being used,  and they propose  that  this  kind of  alignment 

works via a priming mechanism: If a hearer hears an utterance that activates a particular 

representation, then priming creates an expectation that makes it more likely that the hearer 

will subsequently produce an utterance that uses that representation when he takes on the 

speaker role. This kind of interactive priming becomes an essential part of Pickering & 

Garrod’s approach to alignment.  

The starting point for the Pickering & Garrod approach was apparently Garrod and 

Anderson (1987), who introduced a principle of output/input coordination: in a maze game 

task, players tended to make the same semantic and pragmatic choices that held for the 

utterances that they had just heard. In other words, what they said, i.e. their outputs, tended 

to match what they heard, i.e. their inputs at the level of the situation model. During the 

course of interaction the semantic and pragmatic representations used for producing output 

and processing input became aligned. The studies provide (cf. Garrod & Anderson 1987, 

Brown-Schmitt et al. 2005) evidence for alignment of situation models in comprehension.

The conclusion to be drawn for the present study is the interesting fact that if there is a 

factor constraining the speaker’s situation model, it also constrains the listener’s situational 

model.

2.2.7 Levels of alignment

In the Introduction, alignment was defined with reference to different levels of language, 

and  in  the  literature  relations  such  as  repetition  and  imitation  are  mentioned  in  this 

connection.  Transcriptions  of  dialogues  (see  the  corpus  linguistic  study in  Chapter  4) 

contain numerous number of repeated linguistic elements and structures, which indicates 

that there is alignment not only of the situational model, but also at other levels (Aijmer 

17


Communicative Alignment of Synthetic Speech – Jolanta Bachan

1996; Schenkein 1980; Tannen 1989).  As Pickering & Garrod point out,  the following 

levels may become aligned during dialogue:

1. Lexicon: the same expressions tend to be used while referring to particular objects; 

the  expressions  become  shorter  an  more  similar  when  used  with  the  same 

interlocutor and get modified if the interlocutor changes (Brennan & Clark 1996; 

Clark & Wilkes-Gibbs 1986; Wilkes-Gibbs & Clark 1992).

2. Syntax: interlocutors tend to use the same syntactic  structures (  Branigan et  al. 

2000)

3. Phonetics:  the  articulation  of  interlocutors’  repeated  expressions  becomes 

increasingly  reduced,  i.e.  the  expressions  developed  during  a  dialogue  are 

shortened  and  harder  to  recognise  when  heard  in  isolation.  Additionally, 

interlocutors  tend  to  align  accent  and  speech  rate  (Giles  et  al.  1992;  Giles  & 

Powesland 1975). 

4. Semantics  and  pragmatics:  some  evidence  on  comprehension  was  provided  by 

Levelt and Kelter (1982, Experiment 6) in which subjects were presented with the 

question-answer pairs and their task was to assess their naturalness. Pairs in which 

repeated form was used got the best scores. This suggests that people prefer to get 

responses aligned with their own form.

Pickering and Garrod (2004) say that in successful dialogue the interlocutors develop 

aligned situation models and aligned representations at all linguistic levels. Additionally, 

priming at one level leads to priming at other levels.  

However, Pickering and Garrod are not very precise about the formal properties of 

semantic alignment, and they do not underline the importance of alignment on the semantic 

level being essential for successful communication. Also, they do not deal with cooperative 

non-alignment, where one person is stressed and the other person does not align but tries to 

persuade the first to align on the stress-free person, and which is required for scenarios in 

the present research.

18


Communicative Alignment of Synthetic Speech – Jolanta Bachan

2.2.8 Error trapping with misalignment

An important activity in dialogue is error trapping, in this case recovery from a state of 

misalignment, when the interlocutors interpretations of utterances differ, for instance with 

ambiguities. In dialogue it happens that people use the same name, but they think of two 

different people. These interlocutors align on the superficial level, but their situation model 

is misaligned. In such cases the interlocutors need to use recovery mechanisms which will 

help them establish alignment, i.e. establish who is the person they refer to.

The recognition of errors and the treatment  of errors is  a necessary property of a 

spoken dialogue system.

2.3 Communicative signs: function and processing
Communication  uses  signs,  and alignment  means the alignment  of  signs  with all  their 

properties  which  are  involved  in  communication.  Alignment  processes  cover  syntax, 

semantics  and pragmatics.  Therefore understanding what  alignment  is  also  depends on 

understanding what a sign is.

The de Saussure sign model (1913) is shown in Figure 2, which shows the meaning-

form (signifié-signifiant) relation, which de Saussure sees as a mental relation between the 

concept and the sound image. The picture in the middle illustrates the relation. 

Figure 2: The Saussurean sign model

2.3.1 Levelt & Schriefers’s ‘sign pie’

The  Levelt  &  Schriefers  (1987)  model,  which  is  known  as  the  ‘sign  pie’,  has  three 

components, unlike de Saussure’s model, which has two components. The third component 

is syntax, which answers a criticism of de Saussure’s model (and the models of Bühler and 

Jakobson) which do not explicitly contain a syntax component. The sign pie model, which 

is also a mental model, is visualised in Figure 3.

19


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Figure 3: Levelt & Schriefers's ‘sign pie’ (1987:396).

Levelt & Schriefers 1987: 396) point out:

An item’s syntactic properties always play a crucial role in the sentence generation 
process. They determine the syntactic environments that must be realized if that item 
is to be used, and these in turn impose constraints on the syntactic properties of further  
items to be retrieved. Or to put it differently: where concepts clearly serve as input for 
lexical access in speech production, yielding sound images as output,  syntax plays 
both input and output roles.

Examples of the importance of syntax are found with prepositions, which may depend 

more on grammatical relations than on meaning relations. The Levelt & Schriefers model 

is used as the basis for a model of activation in communication, as shown in Figure 4.

Figure 4: Levelt & Schriefers image of the activation of a linguistic sign in  
speech production (Levelt & Schriefers 1987: 396)

The extended model of Levelt and Schrievers shows a move from the language-as-

product  view  of  traditional  sign  models  to  the  language-as-action  approach  which  is 

20


Communicative Alignment of Synthetic Speech – Jolanta Bachan

necessary in psycholinguistics and speech technology. Pickering & Garrod comment on the 

language-as-product tradition as follows:

The  language-as-product  tradition  is  derived  from the  integration  of  information-
processing psychology with generative grammar and focuses on mechanistic accounts 
of how people compute different levels of representation. (Pickering & Garrod 2004: 
170)

They point out that in the language-as-action tradition

utterances are interpreted with respect to a particular context and takes into account  
the goals and intentions of the participants.  This tradition has typically considered 
processing in  dialogue using apparently natural  tasks  (e.g.,  Clark 1992;  Fussell  & 
Krauss 1992). (Pickering & Garrod (2004: 170)

Finally they compare the two traditions:

Whereas psycholinguistic accounts in the language-as-product tradition are admirably 
well-specified,  they  are  almost  entirely  decontextualized  and,  quite  possibly, 
ecologically invalid. On the other hand, accounts in the language-as-action tradition 
rarely make contact  with the basic  processes of production or  comprehension,  but 
rather present analyses of psycholinguistic processes purely in terms of their goals 
(e.g., the formulation and use of common ground; Clark 1985; Clark 1996; Clark & 
Marshall 1981). (Pickering & Garrod (2004: 170)

Although Pickering  & Garrod claim that  the  product  approach is  not  relevant  for 

alignment, this is not true in the context of computation. A product is at the same time a 

result of processing, and also an input for processing. In spoken dialogue, one product (for 

example a situation model) is changed into another product (a modified situation model) 

by processing. So the two approaches are not as incompatible as Pickering & Garrod claim.

The Levelt model is extended in other work. The Levelt production model has three 

main  components,  and  is  the  planning,  formulation  (with  two  subcomponents)  and 

articulation components (Table 2).

21


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Table 2: Processing modules in speech generation and their relation to phases of lexical  
access (Levelt & Schriefers 1987: 398)

Processor Input Output Relation to Lexical Access

Conceptualiser communicative 
intention

preverbal message creating a lexical item’s conceptual 
conditions

Grammatical encoder preverbal message surface structure retrieval of lemma, i.e. making the item’s 
syntactic properties available, given 
appropriate conceptual or syntactic 
conditions

Sound form encoder surface structure phonetic or articulatory 
plan for utterance

retrieval of the lexeme, i.e. the item’s stored 
sound form specifications, and its 
phonological integration in the articulatory 
plan

Articulator phonetic plan overt speech executing the item’s context-dependent 
articulatory program

The Formulator (Figure 5), which is the most relevant component in this context, is 

characterised  as  follows  by  Levelt  (1992):  In  speech  production  the  formulator   is 

described as a process whose input is the lexical concept (the message) and whose output 

is a phonetic or articulatory plan for the item. The appropriate item for the mental lexicon 

is selected and is  integrated into the developing grammatical encoding. An articulatory 

program is created for the selected lexical item on the basis of its stored phonological code 

and the phonological context of the utterance as a whole.

Figure 5: An outline of lexical access in speech production (Levelt 1992: 4)

22


Communicative Alignment of Synthetic Speech – Jolanta Bachan

2.3.2 The revised Interactive Alignment model of dialogue processing

According to Pickering and Garrod (2004: 175) 

the  interactive  alignment  model  assumes  that  successful  dialogue  involves  the 
development of aligned representations by the interlocutors. This occurs by priming 
mechanisms  at  each  level  of  linguistic  representation,  by percolation  between  the 
levels so that alignment at one level enhances alignment at other levels, and by repair 
mechanisms when alignment goes awry. 

Figure 6 illustrates the alignment process. The linguistic levels of two interlocutors are 

linked. In  Figure 6, A and B represent two interlocutors in a dialogue in this schematic 

representation of the stages of comprehension and production processes according to the 

interactive alignment model. The horizontal links show the channels by which alignment 

takes place at these levels by means of the Priming mechanism, including lexical priming, 

syntactic priming, etc.

Figure 6: Schematic representation of the stages of comprehension and production  
processes according to the interactive alignment model (Pickering & Garrod 2004: 176)

23


Communicative Alignment of Synthetic Speech – Jolanta Bachan

The interactive alignment model does not apply in this way to monologues, whose 

goal  is  not  to  become  aligned  with  the  listener,  although  indirect  alignment  (with 

previously experienced communication) may occur. In a monologue the speaker tries to 

formulate  the  message  in  such  a  way  that  the  listener  can  obtain  the  appropriate 

representation  corresponding  to  the  speaker’s  message.  The  important  fact  is  that  in 

monologue (including writing) the speaker’s and the listener’s representations may never 

align,  the automatic mechanism of alignment is not present. The alignment mechanism 

occurs only when the speaker gets regular feedback from the interlocutors and on the basis 

of this he or she can control the alignment process. In dialogue,  priming is the central 

mechanism in the process of alignment and mutual understanding. Thus dialogue indicates 

the  important  functional  role  of  priming (Pickering  and Garrod 2004).  The process  of 

interactive alignment by priming is supported by further factors:

1. The use of routine procedures in dialogue.

2. The use of implicit common ground (background knowledge which is assumed to 

be shared) and explicit common ground (which is mentioned in the dialogue).

Pickering & Garrod discuss an alternative model, the autonomous transmission model, 

in which the transfer of information between producers and comprehenders takes place via 

decoupled production and comprehension processes that are isolated from each other (see 

Figure 7). Communication takes place only through the acoustic medium and the messages 

are constructed independently by the speaker and the hearer.

Pickering and  Garrod  (2004)  say  that  the  autonomous  transmission  model  is  not 

appropriate  for  dialogue.  In  dialogue  the  production  and  comprehension  processes  are 

coupled and this is the core of the interactive alignment model.

However, it is not clear how the interactive alignment model can be represented in a 

precise model: the horizontal connections between levels do not exist independently of the 

physical signal transmission. Therefore, contrary to what Pickering and Garrod claim, the 

interactive alignment processes at different levels in the overall alignment procedure can 

only be reconstructed from an autonomous transmission model  of  physical  contact  via 

speech.

24


Communicative Alignment of Synthetic Speech – Jolanta Bachan

Figure 7: Schematic representation of the stages of comprehension and production 
processes according to the autonomous transmission account (Pickering & Garrod 2004:  
177)

2.4 Speech acts and dialogue acts
Austin (1962) presented two theories of speech acts. In the first theory,  he distinguishes 

between constative and performative utterances. Constative utterances can be true or false, 

as in traditional propositions. Performative utterances cannot be true or false, but perform 

some action, for example questions, commands, promises, etc. In the second theory, the 

functions or ‘force’ of utterances were treated, and it was claimed that there are no basic 

distinctions between constative and performative utterances, which all share certain types 

of force:

1. locutionary force (propositional content of utterance – predicates and arguments),

2. illocutionary force (conventional use of utterances to create social links between 

the  interlocutors),  “doing  things  with  words”  -  “hereby”  =  “niniejszym”, 

illocutionary verbs = speech act verbs,

25


Communicative Alignment of Synthetic Speech – Jolanta Bachan

3. perlocutionary force (effect the utterance has on the hearer).

The forces are indicated by language forms and structures:

1. word order,

2. stress,

3. intonation contour,

4. punctuation,

5. the mood of the verb,

6. the  so-called  performative  verbs  (e.g.  ‘say’,  ‘tell’,  ‘confess’,  ‘promise’,  ‘warn’, 

‘baptise’).

Searle  (1969) extended and modified Austin’s theory and developed 9 constitutive 

rules which define successful utterance, which define the role of the speaker and his beliefs 

about the hearer in producing a successful utterance, for which he distinguishes between 

utterance acts (produced words), propositional acts  (assigning meaning to the utterance 

acts)  and illocutionary acts  (similar  to  Austin’s  ‘illocutionary force’).  Searle  gives  the 

example of ‘promise’:

Given that a speaker S utters a sentence T in the presence of a hearer H, then, in the 
literal utterance of T, S sincerely and non-defectively promises that p to H if and only 
if the following conditions 1-9 obtain. (Searle 1969: 57)

Searle’s formulation of the felicity conditions for promising (1969) are:

1. Normal input and output conditions obtain.

2. S expresses the proposition that p in the utterance of T.

3. In expression that p, S predicates a future act A of S. 

4. H would prefer S’s doing A to his not doing A, and S believes H would prefer his 

doing A to his not doing A.. 

5. It is not obvious to both S and H that S will do A in the normal course of events. 

6. S intends to do A. 

7. S intends that the utterance of T will place him under an obligation to do A.

8. S intends (i-1) to produce in H the knowledge (K) that the utterance of T is to count 

as placing S under an obligation to do A. S intends to produce K by means of the 

26


Communicative Alignment of Synthetic S