Uniwersytet im. Adama Mickiewicza w Poznaniu Wydział Neofilologii Instytut Językoznawstwa Communicative Alignment of Synthetic Speech Jolanta Bachan Rozprawa doktorska Opiekun naukowy: prof. UAM dr hab. inż. Grażyna Demenko Poznań 2011 Communicative Alignment of Synthetic Speech – Jolanta Bachan Acknowledgements I wholeheartedly thank Professor Grażyna Demenko for her help, support and supervision of my work over the years of cooperation. I would like to thank Professor Piotra Łobacz for her continuing support, belief and trust in me. I would like to extend special thanks to Professor Dafydd Gibbon for his invaluable help and discussions about my work. Further thanks to Professor Maciej Karpiński for providing his dialogue corpus and teaching me about human-computer interaction and dialogue analysis in various classes. Thanks to Professor Władysław Zabrocki for being always helpful in solving problems connected with the thesis academic procedures. I would also like to thank my colleagues and all the students, friends and relatives who willingly took part in my experiments. Without their input, my work would not have been possible. Further thanks to the Bielefeld University, the Kulczyk Family Foundation, the Scholarship Foundation of Professor Władysław Kuraszkiewicz and the International Speech Communication Association for awarding scholarships to me, which helped me to focus on my academic work and develop my scientific interests. And finally, immeasurable thanks to my parents and my brother for their unfailing support and love for me. The research presented in this thesis was partly carried out within the scope of the research grant no. N N104 11 98 38 received from the Minister of Science and Higher Education. Badania przedstawione w pracy zostały częściowo zrealizowane w ramach grantu promotorskiego nr N N104 11 98 38 przyznanego przez Ministra Nauki i Szkolnictwa Wyższego. i Communicative Alignment of Synthetic Speech – Jolanta Bachan Table of Contents Acknowledgements ............................................................................................................... i Index of Tables .................................................................................................................. viii Index of Figures .................................................................................................................. xi Chapter 1: Introduction ...................................................................................................... 1 1.1 Objectives of the thesis .............................................................................................. 1 1.2 Motivation of the thesis ............................................................................................. 2 1.3 Alignment and accommodation ................................................................................. 4 1.4 Modelling dialogue .................................................................................................... 6 1.5 Contributions of the present research ........................................................................ 8 1.6 Overview ................................................................................................................... 9 Chapter 2: Alignment – critical overview ........................................................................ 10 2.1 Chapter overview ..................................................................................................... 10 2.2 Basic alignment models ........................................................................................... 10 2.2.1 Alignment as a social phenomenon ................................................................ 11 2.2.2 Alignment as Audience Design ...................................................................... 12 2.2.3 Alignment as Priming .................................................................................... 12 2.2.4 Alignment as Inter-level Interaction .............................................................. 13 2.2.5 Alignment in human-computer interaction .................................................... 14 2.2.6 Alignment, coordination, and situation models ............................................. 15 2.2.7 Levels of alignment ........................................................................................ 17 2.2.8 Error trapping with misalignment .................................................................. 19 2.3 Communicative signs: function and processing ...................................................... 19 2.3.1 Levelt & Schriefers’s ‘sign pie’ ..................................................................... 19 2.3.2 The revised Interactive Alignment model of dialogue processing ................. 23 2.4 Speech acts and dialogue acts .................................................................................. 25 2.5 Summary .................................................................................................................. 28 Chapter 3: Dialogue modelling ........................................................................................ 29 3.1 Dialogue systems ..................................................................................................... 33 3.2 Dialogue system components .................................................................................. 34 3.3 Spoken dialogue systems ......................................................................................... 35 ii Communicative Alignment of Synthetic Speech – Jolanta Bachan 3.4 Human-computer interaction ................................................................................... 35 3.5 Summary .................................................................................................................. 36 Chapter 4: Corpus linguistic study of dialogue interaction .............................................. 37 4.1 Chapter overview ..................................................................................................... 37 4.2 Aim of the corpus linguistic study ........................................................................... 37 4.3 Speech material - PoInt corpus ................................................................................ 38 4.4 Annotation ............................................................................................................... 39 4.4.1 Annotation procedure ..................................................................................... 39 4.4.2 Dialogue act annotation ................................................................................. 40 4.4.3 Phonemic annotation ...................................................................................... 42 4.4.4 Processing of annotations for dialogue analysis ............................................ 42 4.4.5 Notes on material preparation ........................................................................ 43 4.5 Time structure of the dialogue ................................................................................. 44 4.6 Most frequent dialogue act sequences ..................................................................... 45 4.6.1 Dialogue initiation ......................................................................................... 45 4.6.2 Dialogue termination ..................................................................................... 45 4.6.3 Turns .............................................................................................................. 45 4.7 Frequency of dialogue acts ...................................................................................... 46 4.7.1 Dialogue flow ................................................................................................. 56 4.7.2 Overlapping speech ........................................................................................ 56 4.7.3 Non-overlapping speech ................................................................................ 60 4.8 Conclusions ............................................................................................................. 66 Chapter 5: Modelling dialogue sequences with finite automata ...................................... 67 5.1 Chapter overview ..................................................................................................... 67 5.2 Automaton models ................................................................................................... 67 5.3 First steps in realistic automaton creation ............................................................... 68 5.4 Generalisations over finite regular languages .......................................................... 70 5.4.1 Prefix generalisations ..................................................................................... 70 5.4.2 Suffix generalisations ..................................................................................... 74 5.5 Generalisations over non-finite regular languages .................................................. 76 5.5.1 Local generalisations ..................................................................................... 76 5.5.2 Non-local generalisations ............................................................................... 76 iii Communicative Alignment of Synthetic Speech – Jolanta Bachan 5.6 Turn automata .......................................................................................................... 77 5.7 Evaluation of dialogue act automata ........................................................................ 80 5.7.1 General evaluation criteria ............................................................................. 80 5.7.2 NDFST interpreter online tool ....................................................................... 80 5.7.3 Evaluation results ........................................................................................... 82 5.8 Loop-free automata evaluation ................................................................................ 83 5.9 Iterative automata .................................................................................................... 85 5.10 Further issues: dialogue flow and alignment ......................................................... 86 5.10.1 Generalised turn automaton at time line ...................................................... 89 5.11 Summary ................................................................................................................ 93 Chapter 6: Speech synthesis module ................................................................................ 94 6.1 Chapter overview ..................................................................................................... 94 6.2 The role of speech synthesis .................................................................................... 94 6.3 Synthesis experiment with corpus linguistic analysis ............................................. 96 6.3.1 MBROLA micro-voice creation .................................................................... 96 6.4 Automatic Close Copy Speech synthesis ................................................................. 97 6.5 MBROLA full voice creation .................................................................................. 99 6.5.1 MBROLA data flow architecture ................................................................... 99 6.5.2 Corpus specification ....................................................................................... 99 6.5.3 Text corpus creation ..................................................................................... 101 6.6 The Mbrolator software ......................................................................................... 103 6.7 The phone and diphone sets ................................................................................... 103 6.7.1 Phoneme set ................................................................................................. 103 6.7.2 Diphone set .................................................................................................. 105 6.7.3 Search for diphones ...................................................................................... 105 6.7.4 Annotation of the original synthesis corpus ................................................. 107 6.7.5 Annotation file format .................................................................................. 107 6.7.6 Search procedure in available diphone database .......................................... 110 6.7.7 Diphone search in synthesis text and online. ............................................... 111 6.8 Phonetically rich sentence extractor ...................................................................... 112 6.8.1 Diphone set creation ..................................................................................... 112 6.8.2 Available text resources ................................................................................ 113 iv Communicative Alignment of Synthetic Speech – Jolanta Bachan 6.9 Software ................................................................................................................. 113 6.9.1 Sentence extraction procedure ..................................................................... 113 6.9.2 Results of sentence extraction ...................................................................... 113 6.9.3 Automatic diphone extraction system architecture ...................................... 114 6.9.4 Automatic diphone extraction system design ............................................... 115 6.9.5 Automatic diphone extraction system implementation ................................ 117 6.9.6 BLF to TextGrid conversion ........................................................................ 117 6.9.7 PE-SAMPA TextGrid to SAMPA TextGrid conversion ............................... 118 6.9.8 Find all diphones in TextGrid files .............................................................. 121 6.9.9 Diphone extraction ....................................................................................... 122 6.9.10 Evaluation of the automatically extracted diphones .................................. 124 6.9.11 Generate TextGrids for diphones ............................................................... 124 6.9.12 Concatenate diphones ................................................................................ 125 6.9.13 PL2 synthetic Polish male voice evaluation .............................................. 127 6.10 Summary .............................................................................................................. 131 Chapter 7: Dialogue corpus for demonstration prototype .............................................. 132 7.1 Chapter overview ................................................................................................... 132 7.2 Corpus design ........................................................................................................ 132 7.2.1 Prompt speech material and the recording scenarios ................................... 133 7.2.2 Subjects ........................................................................................................ 134 7.2.3 Recordings ................................................................................................... 135 7.3 Implementation ...................................................................................................... 137 7.3.1 Creation of maps .......................................................................................... 137 7.3.2 Creation of diapixes ..................................................................................... 138 7.3.3 Reading task ................................................................................................. 140 7.3.4 Instruction to the subjects ............................................................................ 141 7.3.5 Recording scenario ....................................................................................... 142 7.4 Corpus creation ...................................................................................................... 144 7.5 Corpus annotation .................................................................................................. 148 7.5.1 General analysis of the corpus ..................................................................... 153 7.5.2 Analysis of the selected dialogue ................................................................. 154 7.5.3 Duration analysis: the nPVI index ............................................................... 156 v Communicative Alignment of Synthetic Speech – Jolanta Bachan 7.6 Prototype dialogue synthesis ................................................................................. 158 7.6.1 Diphone extraction for prototype MBROLA micro-voices ......................... 158 7.6.2 ACCS synthesis of the dialogue ................................................................... 159 7.6.3 ACCS synthesis of the filled pauses “yyy” .................................................. 160 7.7 Finite State Transducer model of the map ............................................................. 162 7.8 Summary ................................................................................................................ 171 Chapter 8: Demonstration dialogue system .................................................................... 172 8.1 Overview ............................................................................................................... 172 8.2 Requirement specifications .................................................................................... 172 8.3 Design .................................................................................................................... 174 8.3.1 The street map and data elicitation .............................................................. 174 8.4 Implementation ...................................................................................................... 178 8.4.1 Implemented utterances ............................................................................... 183 8.5 Evaluation .............................................................................................................. 185 8.6 Results ................................................................................................................... 188 8.7 Summary ................................................................................................................ 193 Chapter 9: Summary and conclusions ............................................................................ 194 Bibliography . ................................................................................................................... 197 Software ............................................................................................................................ 205 Appendix A Dialogue act matrix ...................................................................................... 206 Appendix B Loop-free automata for speaker 1 ................................................................ 208 Appendix C Reduction of multi-layered labels ................................................................ 220 Appendix C.1 Speaker 1 .............................................................................................. 220 Appendix C.2 Speaker 2 .............................................................................................. 221 Appendix D Generalisation tables .................................................................................... 223 Appendix D.1 Prefix generalisation table for speaker 1 .............................................. 223 Appendix D.2 Prefix generalisation table for speaker 2 .............................................. 224 Appendix D.3 Suffix generalisation table for speaker 1 .............................................. 225 Appendix D.4 Suffix generalisation table for speaker 2 .............................................. 226 Appendix E Semi-coupled automata for speaker 1 and speaker 2 ................................... 228 Appendix F Loop-free automata ...................................................................................... 230 Appendix F.1 Loop-free automata for speaker 1 ......................................................... 230 vi Communicative Alignment of Synthetic Speech – Jolanta Bachan Appendix F.2 Loop free automata for speaker 2 ......................................................... 231 Appendix G Iterative automata ........................................................................................ 233 Appendix G.1 Iterative automata for speaker 1 ........................................................... 233 Appendix G.2 Iterative automata for speaker 2 ........................................................... 234 Appendix G.3 Generalised automata for speaker 1 ..................................................... 237 Appendix H Automata evaluation .................................................................................... 239 Appendix H.1 Generalised automata ........................................................................... 239 Appendix H.2 Semi-coupled automata ........................................................................ 241 Appendix I Phonetically rich sentence extractor .............................................................. 244 Appendix J Automatic diphone extractor – scripts .......................................................... 251 Appendix J.1 BLF2TextGrid converter ....................................................................... 251 Appendix J.2 extendedPL2PL1 TextGrid converter .................................................... 255 Appendix J.3 Find diphones ........................................................................................ 260 Appendix J.4 Cut out individual diphones .................................................................. 264 Appendix J.5 Generate TextGrids for diphones ........................................................... 268 Appendix J.6 Concatenate diphones ............................................................................ 271 Appendix K Text material used for the Polish MBROLA voice creation ........................ 276 Appendix K.1 Phonetically rich sentences .................................................................. 276 Appendix K.2 Word list ............................................................................................... 286 Appendix L Perception test sentences .............................................................................. 288 Appendix L.1 Test 1 ..................................................................................................... 288 Appendix L.2 Test 2 ..................................................................................................... 289 Appendix M Map task: emergency scenario .................................................................... 290 Appendix M.1 The map for the leading person ........................................................... 290 Appendix M.2 The map for the following person ....................................................... 291 Appendix N Map task: neutral scenario .......................................................................... 292 Appendix N.1 The map for the leading person ........................................................... 292 Appendix N.2 The map for the following person ........................................................ 293 Appendix O Draw wavform, pitch and annotation for stereo sounds – Praat script ........ 294 Appendix P Demonstration dialogue system script .......................................................... 296 vii Communicative Alignment of Synthetic Speech – Jolanta Bachan Index of Tables Table 1: Dialogue excerpt with lexical alignment ................................................................ 5 Table 2: Processing modules in speech generation and their relation to phases of lexical access (Levelt & Schriefers 1987: 398) ...................................................................... 22 Table 3: Abbreviation of dialogue act functions ................................................................ 41 Table 4: Basic statistics of the studied material; N – number of sequences, n ≤ 2 – number of sequences with the length of one or two dialogue acts ................. 45 Table 5: Dialogue act length ............................................................................................... 47 Table 6: Frequency of different dialogue acts in the whole dialogue for both speakers .... 48 Table 7: Number of dialogue acts at the beginning (S) and end (E) of dialogue act sequences in a turn, and single turns (M) build by one utterance; o - open meeting, s - social communication management ............................................................................ 49 Table 8: Number of different dialogue acts at the beginning of a sequence in a turn ....... 50 Table 9: Dialogue acts at the beginning of a turn for speaker 1 and speaker 2 .................. 51 Table 10: Number of different dialogue acts at a single-utterance turn, with time measurements; Dur – duration, Avg – average length ................................................ 51 Table 11: Dialogue acts of single-utterance turns for speaker 1 and speaker 2 .................. 53 Table 12: Number of different dialogue acts at the end of a sequence in a turn ............... 54 Table 13: Dialogue acts at the end of a turn for speaker 1 and speaker 2 .......................... 56 Table 14: Overlapping dialogue acts: spk 2 starts talking before spk 1 has finished ......... 58 Table 15: Overlapping dialogue acts: spk 1 starts talking before spk 2 has finished ......... 59 Table 16: Non-overlapping dialogue acts: spk 2 starts talking after spk 1 has finished .... 61 Table 17: Non-overlapping dialogue acts: spk 1 starts talking after spk 2 has finished .... 62 Table 18: Normalised difference of speakers’ speech at different categories. ID – ID of the dialogue chunk (position in dialogue), Dur – speech duration ................................... 64 Table 19: Difference between the main categories ............................................................. 64 Table 20: Excerpt of table with loop-free automata for each sequence of dialogue acts for speaker 2 .................................................................................................................... 69 Table 21: Examples of reduction of multi-layered labels to one-layered labels for speaker 2 sorted alphabetically. ID – ID of the automaton ......................................................... 70 Table 22: A fragment of the prefix generalisation table for speaker 2 ............................... 71 viii Communicative Alignment of Synthetic Speech – Jolanta Bachan Table 23: Initial dialogue acts in sequences for each of the speakers ................................ 73 Table 24: Most frequent two dialogue acts at the beginning of a part for speaker 1 and speaker 2. ................................................................................................................... 73 Table 25: Loop-free automata combining sequences with the same prefix. ...................... 74 Table 26: Suffix generalisation table for speaker 2. M – match ......................................... 75 Table 27: Loop-free automata and their counterparts with loops for speaker 2. ................ 77 Table 28: Fragment of evaluation table of loop-free automata. for speaker 1 ................... 84 Table 29: An evaluation table of iterative automaton for speaker 1. .................................. 85 Table 30: Extended SAMPA phoneme labels used for annotation (Demenko et al. 2003) .. ............................................................................................................................... ... 100 Table 31: Polish SAMPA transcription used in the PL1 Polish female MBROLA voice (Szklanny & Marasek 2002) ..................................................................................... 104 Table 32: Mismatches between BLF and PL1 SAMPA ................................................... 104 Table 33: Fragment of BLF file input resource. ............................................................... 108 Table 34: The format of an interval in TextGrid file ........................................................ 118 Table 35: The mapping table of PE-SAMPA set onto SAMPA set ................................... 119 Table 36: The phones [c] and [J] from the BLF SAMPA annotation convention and their equivalents in the PL1 diphone database. ................................................................. 120 Table 37: Different transcriptions of the word “kiedy” .................................................... 120 Table 38: The DIPH file format with exemplar three lines from a DIPH file .................. 121 Table 39: Diphone label normalisation table .................................................................... 122 Table 40: The SEG file format with three examplar lines from the SEG file. ................. 123 Table 41: Results for Test 1 – average correctly recognised words in predictable and unpredictable sentences. N – number of words ........................................................ 130 Table 42: Test results for Test 2. MOS/5 – Mean Opinion Score out of 5, STDV – standard deviation, Max:Min scores given by subjects ........................................................... 131 Table 43: Pros & cons using either the telephone or the skype call for communication between interlocutors ................................................................................................ 137 Table 44: Difference between diapixes from the emergency scenario ............................. 138 Table 45: Difference between diapixes from the neutral scenario ................................... 139 Table 46: Data of the corpus recording. Age diff – stands for age difference between the interlocutors counted as B’s age – A’s age. ............................................................... 147 ix Communicative Alignment of Synthetic Speech – Jolanta Bachan Table 47: Dialogue acts frequencies and their statistics used for dialogue annotation. N is the number of DA ..................................................................................................... 151 Table 48: Dialogue statistics of emergency dialogue (pair ID: 12). Total dialogue duration 156.49sec .................................................................................................................. 154 Table 49: Special events frequencies ................................................................................ 156 Table 50: Min, Max and Mean (M) pitch values (F0) for Speaker A and Speaker B across the five recording tasks. ............................................................................................ 156 Table 51: nPVI for duration of phones, syllables and pitch values of filled pauses (“yyy”). N is number of items ................................................................................................. 158 Table 52: Diphone manual selection process ................................................................... 159 Table 53: Utterance exchange in the emergency map task dialogue ................................ 164 Table 54: Transitions of FSA designed for the dialogue system. ..................................... 176 Table 55: Informal and formal utterances and their English translations available to the dialogue system ......................................................................................................... 184 Table 56: General data of people who participated in the dialogue system evaluation . . . 186 Table 57: Questionnaire of assessment of 7 areas of the dialogue system and their correspondence to the dialogue system domains ...................................................... 187 Table 58: Dialogue reconstruction based on one log file entry for informal speech style ................................................................................................................................. ..188 Table 59: Basic statistics of functional testing of the dialogue system ............................ 189 Table 60: Results of the judgement testing of the dialogue system in 7 categories. Numbers in brackets stand for average assessment across the 7 categories and 2 scenarios for females (F), males (M) and overall (All) .................................................................. 191 Table 61: Explenation of abbrieviations of dialogue act types. ........................................ 239 x Communicative Alignment of Synthetic Speech – Jolanta Bachan Index of Figures Figure 1: Simplified architecture of a spoken dialogue system.............................................4 Figure 2: The Saussurean sign model...................................................................................19 Figure 3: Levelt & Schriefers's ‘sign pie’ (1987:396)..........................................................20 Figure 4: Levelt & Schriefers image of the activation of a linguistic sign in speech production (Levelt & Schriefers 1987: 396).................................................................20 Figure 5: An outline of lexical access in speech production (Levelt 1992: 4).....................22 Figure 6: Schematic representation of the stages of comprehension and production processes according to the interactive alignment model (Pickering & Garrod 2004: 176)...............................................................................................................................23 Figure 7: Schematic representation of the stages of comprehension and production processes according to the autonomous transmission account (Pickering & Garrod 2004: 177).....................................................................................................................25 Figure 8: A model of human-computer interaction (Schomaker et al. 1995, from Gibbon, Mertins & Moore 2000)................................................................................................36 Figure 9: The Praat window displaying the stereo speech signal of the dialogue with its annotation tiers..............................................................................................................39 Figure 10: Temporal sequences and overlaps in a dialogue.................................................44 Figure 11: Percentage representation of frequency of dialogue acts at the initial position in a turn.............................................................................................................................50 Figure 12: Number of different dialogue acts at the beginning of a sequence in a turn.....50 Figure 13: Percentage representation of frequency of dialogue acts in single-utterance turns ......................................................................................................................................52 Figure 14: Number of different dialogue acts at a single-utterance turn..............................52 Figure 15: Percentage representation of frequency of dialogue acts at the final position in a turn................................................................................................................................55 Figure 16: Number of different dialogue acts at the end of a sequence in a turn.................55 Figure 17: Difference between the two most numerous dialogue categories.......................65 Figure 18: A basic dialogue model implemented as FSA....................................................68 Figure 19: Combined automata 2_back without loops created by suffix generalisation.....75 Figure 20: Combined automata 1_back with loops created by suffix generalisation...........77 xi Communicative Alignment of Synthetic Speech – Jolanta Bachan Figure 21: A semi-coupled automaton 1 for spk1 and spk2.................................................78 Figure 22: A generalised automaton of dialogue acts for speaker 1, the follower of the instructions in the map task..........................................................................................79 Figure 23: A generalised automaton of dialogue acts for speaker 2, the instructor giver in the map task..................................................................................................................79 Figure 24: Automaton of typical dialogue flow....................................................................86 Figure 25: An automaton generating the direction description dialogue type.....................87 Figure 26: An automaton generating the misunderstanding dialogue type..........................87 Figure 27: Generalised turn automaton................................................................................88 Figure 28: Generalised turn automaton for spk 1 with dialogue act occurrence probability ......................................................................................................................................89 Figure 29: Generalised turn automaton for spk 2 with dialogue act occurrence probability ......................................................................................................................................89 Figure 30: Linear representation of generalised turn automata for spk1 and spk 2.............90 Figure 31: Visualisation of overlapping speech being produced by generalised turn automata for spk 1 and spk 2........................................................................................90 Figure 32: Integrated generalised linear 4-stage turn automata for two speakers................91 Figure 33: Integrated generalised "overlapping" 4-stage turn automata for two speakers...92 Figure 34: Mbrolation, the MBROLA micro-voice creation procedure..............................97 Figure 35: Comparison of original recording with microvoice and PL1 female voice .......98 Figure 36: Data flow chart for MBROLA voice creation and runtime synthesis.................99 Figure 37: Phonetically rich sentence extraction procedure...............................................114 Figure 38: Architecture of the automatic diphone extraction system.................................115 Figure 39: Design of the automatic diphone extraction software. PE-SAMPA – the Polish extended SAMPA........................................................................................................116 Figure 40: Conversion flow of text files in the automatic diphone extraction system.......117 Figure 41: Diphone WAV file with automatically generated annotation............................125 Figure 42: Diphone files ordering according to the diphone ID. ......................................126 Figure 43: Diapixes from the emergency scenario.............................................................139 Figure 44: Diapixes for the neutral scenario (adopted from Bradlow et al. 2007).............140 Figure 45: Recording setting of the dialogue corpus.........................................................142 Figure 46: MX Skype Recorder window...........................................................................143 xii Communicative Alignment of Synthetic Speech – Jolanta Bachan Figure 47: TimeLeft timer used for the recording of the emergency scenarios.................144 Figure 48: A person in the emergency setting at the corpus recording..............................145 Figure 49: Annotation of dialogues on speech and special tiers for each speaker.............149 Figure 50: Annotation of dialogues on several tiers for Speaker A (channel 2, bottom) and Speaker B (channel 1, top)..........................................................................................152 Figure 51: Dialogue acts frequency....................................................................................155 Figure 52: Speaker's A and Speaker's B waveforms, pitch contours and annotation tiers of a synthesised dialogue excerpt at 17.5 to 21.5 second..................................................160 Figure 53: Examples of the ACCS synthesised filled pauses for Speaker A (top) and Speaker B (bottom).....................................................................................................161 Figure 54: (A) Emergency map with all junctions marked for selection for the FST nodes; (B) Emergency dialogue automaton with the nodes representing the reachable junctions selected .......................................................................................................163 Figure 55: Map FST with utterance exchanges IDs...........................................................168 Figure 56: Emergency map presented to the human user for the communication scenario with the dialogue system.............................................................................................175 Figure 57: Map task dialogue as a basis for map traversal automaton...............................176 Figure 58: Dialogue system architecture............................................................................178 Figure 59: Dialogue manager automaton with dialogue acts.............................................179 Figure 60: Dialogue manager automaton with exemplar utterances..................................180 Figure 61: Visualisation of the implementation of the dialogue system main algorithm..182 Figure 62: Dialogue system evaluation setting..................................................................186 Figure 63: Semi-coupled automaton 2...............................................................................228 Figure 64: Semi-coupled automaton 3...............................................................................228 Figure 65: Semi-coupled automata 4..................................................................................229 Figure 66: Generalised automaton 1 for speaker 1.............................................................237 Figure 67: Generalised automaton 2 for speaker 1.............................................................237 Figure 68: Generalised automaton 3 for speaker 1.............................................................237 Figure 69: Generalised automaton 4 for speaker 1.............................................................238 Figure 70: Generalised automaton 5 for speaker 1.............................................................238 xiii Communicative Alignment of Synthetic Speech – Jolanta Bachan Chapter 1: Introduction 1.1 Objectives of the thesis The central claim of the thesis is that a dialogue system should be well-motivated by dialogue theory and by analysis of actual dialogues, and that the resulting system should be tested in a real-world scenario. Based on this claim, the thesis concentrates on methodology and investigates a wide range of methods required for fulfilling these requirements adequately. The operational aim is to provide a simple proof-of-concept dialogue system based on the claim and combining written and spoken communication. The operational aim is therefore not to develop a fully functional dialogue system, but a prototype which illustrates the main claim and the methodology of the thesis in a simulated stressful emergency scenario. The alignment theories discussed by Pickering and Garrod (2004) will be the focus of the present work. According to the alignment theories, alignment in dialogue takes place on semantic, syntactic and pragmatic levels. In the present thesis the work is focused on the semantic level and the thesis claim is: Alignment of semantic representations is essential for successful communication in a dialogue. The intention is to test semantic alignment both descriptively, using the dialogue act approach of Bunt (2000) and with two corpus linguistic studies, and operationally, with a finite state text-in-voice-out dialogue system which has been specially designed for this purpose. The finite state dialogue system uses a male Polish synthetic voice which was created for this application, and an innovative combination of two finite state systems: a finite state dialogue manager which controls a finite state map traversal system. To assure success in communication, routines for recovery from misalignment have also been addressed in the dialogue manager. The methodologies which are dealt with include: 1 Communicative Alignment of Synthetic Speech – Jolanta Bachan 1. Linguistic dialogue theories. 2. Theory-based corpus linguistic description of dialogue. 3. Dialogue modelling with automata. 4. Speech synthesis component of a dialogue system and voice creation for speech synthesis module and its evaluation. 5. Dialogue corpus creation and evaluation with microvoices (synthetic voices which only cover a restricted range of the language, for experimental purposes). 6. Dialogue system demonstration prototype and evaluation. In order to create the demonstration prototype, the specific computational linguistic issues to be addressed include: 1. Dialogue design based on a formal analysis of the dialogue act in the first corpus linguistic study, with finite state modelling, and on a scenario-specific dialogue act analysis in the second corpus linguistic study. 2. Formal-informal speech style selection in a realistic stress scenario (emergency dialogue with a hospital call-centre). 3. Formal properties of automaton models. 4. Information extraction from two corpora for dialogue modelling. 5. Information extraction from text and speech corpora and a speech corpus creation for synthetic speech voice creation. 1.2 Motivation of the thesis In the information society people need to cooperate more and more with computer systems, and therefore computer systems need to be designed which make this cooperation easier. Typical activities such as looking for timetables on the internet, booking flights via online forms and changing the settings of a mobile phone in call centres are very common. The human user has to follow automatic instructions because in general there is no human operator. However, such communication systems are not natural, often the processes are lengthy and time-consuming, and they are always restricted to the pre-defined options of the system. In certain situations, when these options fail the customers are redirected to 2 Communicative Alignment of Synthetic Speech – Jolanta Bachan human operators as the required tasks are too complex for the system. Two main issues are involved here: first, the ‘intelligence’ of the system, and second, the ‘naturalness’ of the input-output interaction. The present study concentrates on input-output interaction with text-in-voice-out dialogue, a common configuration in commercial information systems such as satellite navigation devices and screen readers for the blind. Because talking is more natural than dialling numbers or filling in text forms, many institutions provide call centres where people can choose to talk about their problems or requests with a human operator. However, human work time is very expensive and one person can basically deal with just one customer at a time. Therefore in the technologies concerned with making input-output issues more natural much effort is being put into the development of dialogue systems which can communicate with a human being via the speech signal and deal with more than one customer at a time (cf. the Vermobil project, Wahlster 2000, and the SmartKom project, Wahlster 2006). Such a dialogue system has a speech recognition module which receives human speech input and converts it to a form which is understandable by the computer and produces synthetic speech to provide information back to the user. The communication between the human user and the computer system is administered by a dialogue manager which decides on the next actions the system should take. In addition to acoustic speech recogniser and speech synthesiser components, the system also includes computational linguistic components such as a machine-readable lexicon together with a parser which extracts meaning from the pre- processed human speech, and a natural language module generation which converts the reply created by the dialogue manager into the natural language form. An example of a dialogue system architecture is shown on Figure 1. 3 Communicative Alignment of Synthetic Speech – Jolanta Bachan Figure 1: Simplified architecture of a spoken dialogue system 1.3 Alignment and accommodation In recent years new aspects of communication have been investigated which are relevant for developing natural human-computer dialogue interaction. These include alignment of communication form and content between the interlocutors (Pickering & Garrod 2004) and accommodation of interlocutors to each other (Giles et al. 1992). It has been noticed that while communicating, interlocutors tend to adopt each other’s behaviour such as style of speaking, vocabulary, gestures. In the present context, alignment is meant here as adaptation on the syntactic, semantic and pragmatic levels of communication between the two interlocutors, including the choice of similar lexical items and speaking style. However, it needs to be emphasised that the form, content and degree of alignment depends on the communication situation and status relations between the interlocutors. The main distinction for emergency scenarios to be made is between alignment in public and private situations. In public situations in which interlocutors do not know each other the degree of alignment of their behaviours has been found to be smaller than in face-to-face conversations between two close friends (Batliner et al. 2008). In fact, there may be deliberate non-alignment between a call-centre operator and an emotional caller, in order to calm the caller. Table 1 presents a dialogue excerpt with an example of alignment. In the dialogue excerpt coming from the dialogue corpus recorded for the present study, an example of 4 Communicative Alignment of Synthetic Speech – Jolanta Bachan lexical alignment is shown. Here Speaker A, while giving instructions, talks about the roundabout. Speaker B does not see the roundabout, so Speaker A defines it as a ʽcircular flower bed’. In order to be understood, because Speaker A is nervous, Speaker B adapts the word ʽflower bed’ to refer to the roundabout, but then immediately uses again the regular word ʽroundabout’. Speaker A starts to use the word roundabout again appartently unintentionally because her focus later in the dialogue is on giving the next instructions of the route and does not think of the roundabout anymore. Table 1: Dialogue excerpt with lexical alignment Polish English A: [route description] Przy rondzie są roboty, więc trzeba będzie je objechać [route description] A: [route description] At the roundabout there are roadworks, so they must be passed by [route description] B: Może Pani powtórzyć. Nie widzę tutaj ronda po drodze. B: Can you repeat. I don’t see any roundabout on the way. A: Znaczy... rondo, to jest taki, taki okrągły kwietnik. [yyy] jest [yyy] między sklepem a lodziarnią. A: It means... the roundabout, this is, such a circular flower bed [yyy] is [yyy] between the shop and the ice cream parlor. [route description] A: [route description] A: Następnie objechać rondo – ten taki okrągły kwietnik. A: Then go round the roundabout – this circular flower bed. B: Czyli po tym jak skręce w prawo... B: So after having turned right... A: Tak. A: Yes. B: Muszę jeszcze skręcić w lewo, żeby dojechać do tego kwietnika. B: I again have to turn left to get to this flower bed. A: Tak, tak, tak. Jestem dość zdenerwowana iii iii wszystko... wszystko wydaje mi się takie... Przykro mi. A: Yes, yes, yes. I’m quite nervous aaand aaand everything... everything seems to me so... I’m sorry. B: Dobrze. Proszę się uspokoić. Czyli na rondzie gdzie muszę skręcić? B: Good. Please, calm donw. So at the roundabout, where do I have to turn? A: [yyy][um] Na rondzie musi Pani [yyy] na rondzie musi Pani skręcić w [y] obok lodziarni [route description] A: [yyy] [um] At the roundabout you have to [yyy] at the roundabout you have to turn at [y] next to the ice cream parlor. The present study focusses on basic aspects of alignment which are relevant for human-computer communication in stressful emergency scenarios in public. In public stress situations it is necessary to know the conversation is conducted in terms of formal and informal styles, and not what emotions, in the usual senses of the term (‘fear’, ‘anger’, ‘sadness’, ‘happiness’ …, ‘neutral’; cf. Ortony & Turner 1990, Murry & Arnott 2008, Bachan & Surmanowicz 2008), are expressed in the interlocutors’ speech: for present purposes, negative emotions such as ‘fear’, ‘anger’, ‘sadness’ are included in the concept of ‘stress’. The ‘informal’ and ‘formal’ styles are more related to private versus public 5 Communicative Alignment of Synthetic Speech – Jolanta Bachan situations than to emotion, and both may occur in stress scenarios. These distinctions are taken into account in the dialogue system demonstration prototype. If one of the interlocutors becomes involved in a difficult position and undergoes great stress, the interlocutor to which the stressed person talks to will try to align their speech on the syntactic (including lexical), semantic and pragmatic levels (Branigan et al. 2000), but will not try to empathise with the emotional state of their interlocutor. In the course of the conversation, the interlocutors will start to use the same vocabulary (Brennan & Clark 1996; Clark & Wilkes-Gibbs 1986; Wilkes-Gibbs & Clark 1992), but not necessarily both speaker’s voices will start sounding nervous because of the stress affecting one of the interlocutors. However, this is not necessarily the case with a professional call centre operator. It is assumed that the speaking style towards a stressed person (or a person in any other emotional state) is different than toward a person who does not show any emotions. The dialogue system should be able to recognise the emotional states of its users and based on the prosodic and lexical speech characteristics apply a speech style which will be aligned with these emotional states (cf. Batliner et al. 2003). 1.4 Modelling dialogue The goals of the present investigation include providing explicit models for relevant aspects of human-human communication connected with alignment and accommodation. The literature on these topics does not consider ways of aligning synthetic speech with the human interlocutor in their interaction, focussing specifically on stressed and emotional speech in crisis situations, although acceptable human-computer interaction is the subject of much research. The models should enable appropriate speech style selection in these situations, based on the observations that existing models of emotion are both too simple and too speculative, that actors imitating crisis speech are not producing authentic crisis speech, and that in public stress scenarios the formal-informal style dimension is more relevant than emotion space. The general working hypothesis is that it is possible to replace traditional emotion label sets with a generic model of the following type (which would also apply to ‘emotion’ in addition to ‘style’ if required): 6 Communicative Alignment of Synthetic Speech – Jolanta Bachan TRIGGER_SITUATION → STYLE → STYLE_MANIFESTATION The trigger situation is the particular public stress scenario which requires a certain formal or informal communication style. The style manifestation is the set of syntactic, lexical and phonological conventions which are associated with the chosen style. The specific hypothesis is that it is possible to design and implement a speech style selection module based on this model to drive synthesiser-interlocutor alignment, and to implement it in speech synthesis software. Such a module should improve the naturalness and efficiency of human-computer communication. In the spoken dialogue demonstration prototype, the styles and style manifestations are considered, but an automatic recognition of alternative trigger situations (age, gender, social status, task etc.) is not included since a specific single simulation scenario (a variety of map task with university graduate students) is used. For the spoken dialogue demonstration prototype, the focus is on the dialogue manager and speech synthesis modules. In human communication the interlocutors tend to align their behaviours, not only speech, but also gestures and body movements. The present investigation is not concerned with multimodal communication of this kind; consequently, the selected scenario is a telephone-like scenario with no visual contact between interlocutors. The present study is also not concerned with recognising and manipulating phonetic features of speech, e.g. prosodic and paralinguistic features such as voice quality, intonation, rhythm and tempo of human speech. However, styles are also characterised by lexical items and other markers such as hesitation phenomena, repetitions and curses, suggesting different behavioural and expressive states of the interlocutor. Based on the analysis of these items, a dialogue system can generate a kind of output which would be expected in human-human communication. These stylistic markers in human-human communication may also indicate that the communication is not successful; if a recognition module were to be developed, situations when the system cannot understand the speaker would need to be modelled. In such situations the dialogue manager should select a different trigger situation for planning the conversation. Similarly, the dialogue manager may also apply a different speaking style to be generated by the speech synthesis module. Such a system would analyse the trigger situation, for example, domestic violence, and compare this trigger 7 Communicative Alignment of Synthetic Speech – Jolanta Bachan situation with the phonetic features manifesting human emotions, for example, fear. If the dialogue manager finds a scenario to be used in such a situation, it applies the appropriate scenario and an appropriate speech style, for example a reassuring style. 1.5 Contributions of the present research First, the Pickering and Garrod (2004) approach to alignment is criticised and modified in the area of semantic alignment. The first criticism is that Pickering and Garrod are not precise about what semantic alignment is. In the present research, two corpus linguistic studies are undertaken for this purpose, and in the operational system a map with certain unforeseeable properties is used as a reference point for semantic alignment, and negotiation of a route through the map requires semantic alignment of different types. The second criticism is that Pickering and Garrod only deal with cooperative alignment. The present research does not deal with non-cooperative alignment, but it deals with cooperative non-alignment to some extent, between a professional call-centre operator and a caller. Second, the dialogue act approach of Bunt is criticised because in his earlier work, at the time of the corpus linguistic studies, the dialogue acts were simply listed abstractly, with no empirical illustration. A selection of Bunt’s dialogue acts was made for the purpose of the present research, and investigated in the corpus linguistic studies. In later work, Bunt (2010) added empirical information, but did not deal with scenarios such as the emergency calling scenario. A second criticism is that in the earlier work, and to a large extent in the later work, Bunt does not deal with sequences of dialogue acts, but only with a hierarchical classification of dialogue acts. In the present research, sequences of dialogue acts in the corpus and also in the operational system are modelled with finite state automata. Third, the present research has an operational outcome, as a text-input-voice-output dialogue system which is intended to test the points listed above, and an evaluation of this system. The use of two finite state systems, one as a dialogue manager, and the other as a map traversal algorithm, with the dialogue manager controlling the map traversal module. One further original contribution in this context is the new Polish male voice PL2 for the MBROLA (Dutoit et al. 1996) speech synthesis system. 8 Communicative Alignment of Synthetic Speech – Jolanta Bachan 1.6 Overview Following the introduction to the topic and the research aims presented in this chapter, in Chapter 2 a brief selection of relevant theoretical linguistic approaches on alignment to dialogue description and their implications for development of the spoken dialogue demonstration prototype are discussed. In Chapter 3 dialogue modelling is briefly introduced and components of dialogue systems are presented. In Chapter 4 a pilot study in which theoretical principles are applied to actual dialogue description is undertaken. In this study the research is carried on an existing dialogue corpus. Chapter 5 presents work development of provisional automaton models of the dialogue. The aim is to develop techniques and tools for dialogue modelling in the prototype dialogue system. Chapter 6 is concerned with prerequisites for developing a speech synthesis module for a dialogue system. It presents results of diphone search in existing text and speech corpora as well as introduces two tools for efficient diphone database creations developed for this purpose. The creation of a speech corpus used for Polish male synthetic voice creation is presented together with evaluation of the voice. Chapters 7 and 8 present the test of the thesis claim. Chapter 7 is a corpus linguistic study of the kinds of alignment in public emergency dialogues which are required for designing the spoken dialogue demonstration prototype. In this Chapter, creation of dialogue corpus is presented and prompt materials and recording techniques are discussed. The addressed scenarios are stressful emergency situations and neutral dialogues based on map and diapix tasks. The development of the spoken dialogue demonstration prototype, including evaluation with human users, is dealt with in Chapter 8. The chapter presents an innovating technique combining two finite- state-automata which work together in the dialogue system: one for map traversal, and one for dialogue negotiation. Chapter 9 is concerned with the conclusions from the present work and tasks for the future. Much of the empirical and technical material (materials for speech corpus recording scenarios, tables with results of empirical studies, automaton models of dialogue structure, code of software tools) is included in Appendices in order to avoid distraction from the main argument in the text. 9 Communicative Alignment of Synthetic Speech – Jolanta Bachan Chapter 2: Alignment – critical overview 2.1 Chapter overview The specification of a dialogue system depends partly on linguistic, psycholinguistic and logical specifications of the domain of language in dialogue. The discussion of these concepts will be very selective and brief, because relevant studies tend to be very general, from the point of view of speech technology and are important foundations for dialogue system development but not the focus of attention in the present research. In this chapter, the relevant concepts of ‘alignment’, ‘coordination’, ‘common ground’, ‘speech act’, dialogue act’, ‘sign’, and ‘language as-product vs. ‘language-as-action’ will be discussed. The discussion mainly follows the approach of Pickering & Garrod (2004) and Levelt (1992). The main thesis of this chapter is that alignment in dialogue takes place on syntactic, lexical, semantic, and pragmatic levels of language as well as on the obvious levels of pronunciation and prosody of speech. 2.2 Basic alignment models Alignment was defined in the Introduction as adaptation on the syntactic, semantic and pragmatic levels of communication between the two interlocutors, including the choice of lexical items and speaking style; the form, content and degree of alignment depend on the communication situation and status relations between the interlocutors. For the present investigation, the main distinction for emergency scenarios is to be made between alignment in public situations and alignment in private situations, which affect the use of different utterance styles. The problem of emotional alignment is important, but not directly relevant for communication in public situations. Even if a person calling a call- centre is highly stressed and emotional, it is not a good idea for the call-centre response to use the same emotional utterance types, but the response must still be aligned on the basis of appropriate strategies for achieving successful communication with a stressed person. There are several questions which must be answered clearly. 10 Communicative Alignment of Synthetic Speech – Jolanta Bachan What function does alignment have in communication? For the present study, the following function is the most important: The general function of alignment is coordination between interlocutors in order to achieve a successful outcome of communication. Alignment in dialogue is a component of communication, is a social activity, and a successful outcome may be defined on many different levels: alignment of pronunciation, alignment of vocabulary, alignment of syntax, and also alignment of descriptive semantic content and alignment of pragmatic functionality. Another issue is whether alignment is a consciously aware strategic behaviour or a subconsciously implicit behaviour is not in the focus of the present study. What is the purpose of alignment in a dialogue system? Alignment is a kind of behaviour control procedure during communicative interaction. People may use many levels of alignment procedure in communication, including the language features which have been mentioned already, and also gestures of the face, the hands and the position of the body. Clark (1985) suggests that other kinds of non-linguistic coordinated activity, such as dancing, and cooperation on the same practical task, may be subject to the same principles of alignment. What approaches to modelling alignment have been proposed? Pickering and Garrod (2004) outline four approaches which will be discussed below. 2.2.1 Alignment as a social phenomenon As a social phenomenon, alignment in communication depends on status relations between the speakers and listeners, who consider the social effect of their utterances. The principle of alignment as a social phenomenon is that people want to communicate politely, cooperatively and successfully with each other (Grice 1975; Giles et al. 1992; Allwood et al. 2000). It is true that there are also types of communication which are not based on cooperation but on conflict and aggressiveness. In these communication scenarios alignment may be deliberately avoided, but in some way alignment is still a reference point for communication. However, in the stressful emergency dialogue scenario involved in the present study, successful communication will be cooperative and potentially supported by alignment. 11 Communicative Alignment of Synthetic Speech – Jolanta Bachan From the point of view of dialogue system development, an exclusively social view of alignment is too restrictive because it concentrates on the obvious observation that alignment is a social phenomenon. But this is incomplete: by concentrating on pragmatics, the view does not take the necessary formal dimensions of communication such as appropriate formulation (pronunciation, lexicon and syntax), adequate expression of content (semantics) into consideration. 2.2.2 Alignment as Audience Design Another model of alignment considered by Pickering & Garrod (2004) is the audience design model. In the case of audience design, the speaker chooses expressions most likely to be correctly understood and accepted by the listener. The aim of this is to enhance communication on the basis of beliefs which the speaker has about the hearer. The main problems with the theory about the Audience Design mechanism of alignment are: 1. From a processing point of view, the Audience Design is very complex to compute. Many levels of language, speech and interaction have to be taken into account during the alignment process, involving listener modelling and inference making. 2. The Audience Design model does not provide a robust procedure, since each aspect of alignment depends on many assumptions which may not be true. 3. The Audience Design model does not explain the other pragmatic, social and non- linguistic dimensions of alignment which affect the speaker. 2.2.3 Alignment as Priming Alignment seen as Priming involves mechanisms of linguistic representation which are generally considered as being automatic, like other priming processes. Priming means the preparation of a speaker or hearer for behaving in a certain way on the basis of previous perception or behaviour. In this view, Pickering & Garrod (2004) claim that alignment automatically falls out of linguistic processing, because priming applies to many other kinds of linguistic behaviour. Pickering & Garrod point out that this view offers the following features: 1. Priming is cognitively economical: the processes involved are those which are involved in regular speech production. 12 Communicative Alignment of Synthetic Speech – Jolanta Bachan 2. Priming is robust: the need for detailed listener models is not present, information is taken from perception of the immediate context. 3. Priming explains linguistic repetitions and imitation. 4. Priming is computationally less complex for common kinds of phonetic and phonological alignment, which is very rapid and is ‘resource-free”, I.e. does not involve huge cognitive resources. 5. Alignment is a process which takes place below awareness levels. Alignment is a process which does not only concern normal speakers. It also concerns speakers with some kinds of impairment, such as autism. In an experiment, the alignment of Noun Phrase structure in children was examined (Allen et al. 2011). In this experiment, the syntactic alignment behaviour of autistic (Autistic Spectrum Disorder, ASD) and non- autistic children was compared. The children with Autistic Spectrum Disorder (ASD) spontaneously converge, or align, syntactic structure with an interlocutor. Children with ASD were more likely to produce a passive structure to describe a picture after hearing their interlocutor use a passive structure to describe an unrelated picture when playing a card game. Furthermore, they converged syntactic structure with their interlocutor to approximately the same extent as did both chronological and verbal age-matched controls: autistic children, 24%, age-matched children – 21%, verbal-age-matched controls – 20%. These results suggest that the linguistic impairment that is characteristic of children with ASD, and in particular their difficulty with interactive language usage, cannot be explained in terms of a general deficit in linguistic imitation such as alignment by Priming. The Priming point of view can also be criticised. Priming does not explain the more abstract levels of alignment, since it is based exclusively on the perception of linguistic input, and it does not account for functional properties of alignment in increasing the chance of cooperative and successful communication. 2.2.4 Alignment as Inter-level Interaction In the Interaction model, alignment automatically takes place at several different levels of language at the same time. Pickering and Garrod (2004) consider that the Interactive Alignment model is too strong if it is taken literally. For example, it is not always the case 13 Communicative Alignment of Synthetic Speech – Jolanta Bachan that alignment at one level of representation leads to alignment at other levels. Alignment for example at lexical level may mask an underlying misalignment at the semantic level, for example when ambiguity is involved: “John!” may denote John Brown or John Smith, for example. The Inter-Level Interaction model will be referred to again below. For the present study, the point is that the model implies that the different views of alignment may not be competitors. They may occur in combination as simultaneous and interacting procedures in a multiple mechanism composed of the described components: social behaviour, audience orientated, primed, interactive or all of these. The components does not have to be mutually exclusive and some context may require any combination of these components, or all the components. 2.2.5 Alignment in human-computer interaction In studies of human-computer interaction, it has been suggested that the way humans interact with computers is related to beliefs about the social status of interlocutors, beliefs and knowledge about computers, and beliefs about the linguistic capability of interlocutors. It appears that there may be a lower degree of alignment when speakers are to interact with people of lower social status and more alignment when the speaker believes their interlocutors to be linguistically less capable. In human-computer interaction it seems that that people communicate with computers as if computers were like people who are rather stupid and of lower social status. In an experiment using the Reverse Wizard of Oz scenario, lexical alignment was investigated by Branigan (Branigan 2009, cf. Pearson et al. 2006): 83% of alignment occurred when people believed they were interacting to a computer, which was the truth, and 44% of alignment occurred when people believed they were interacting to a human, which was not true as they were interacting with a computer. Similarly, in a second experiment, an advertisement of an older dialogue system for $10, and a new system from 2003 for $299, there was 80% of alignment with a basic version of a program, and 42% of alignment with an advanced version of the program. These experimental results suggest that people align more with computers than with people, and apparently they transfer their beliefs about people they align less with to 14 Communicative Alignment of Synthetic Speech – Jolanta Bachan computers: they also align more with stupid computers than with more smart ones (or rather computers that they think are stupid or smart). 2.2.6 Alignment, coordination, and situation models All of the views discussed so far leave many issues open, in particular the functionality of alignment: what actually is successful communication? In the following sections a number of issues in this area will be discussed briefly, mainly based again on Pickering & Garrod (2004). According Clark (1985), dialogue is a joint activity and coordination is similar in other coordinated activities, such as ballroom dancing or with lumberjacks using a two- handed saw. An obvious case which is not mentioned by Pickering & Garrod or Clark is in some kinds of sports such as tennis, baseball, football, boxing, wrestling. According to another approach, coordination occurs when interlocutors share the same linguistic representation at some level (Branigan et al. 2000; Garrod & Anderson 1987). Pickering and Garrod (2004) prefer to call the first case ‘coordination’ and the second case ‘alignment’. Alignment occurs at a particular level when interlocutors have the same representation at that level. So dialogue is coordinated, but also aligned. But it is not clear whether there are other alignment levels in the other activities which are coordinated. This is not discussed by Pickering & Garrod. Pickering & Garrod (2004) continue their discussion of alignment by introducing situation models and relating them to other alignment concepts: 1. Alignment of situation models (Zwaan & Radvansky 1998) forms the basis of successful dialogue. 2. The way that alignment of situation models is achieved is by a primitive and resource-free priming mechanism. 3. The same priming mechanism for situation models produces alignment at other levels of representation, such as the lexical and syntactic. 4. Interconnections between the levels mean that alignment at one level leads to alignment at other levels. 15 Communicative Alignment of Synthetic Speech – Jolanta Bachan 5. There is another primitive mechanism allows interlocutors to repair misaligned representations interactively. 6. More sophisticated and potentially costly strategies that depend on modelling the hearer’s beliefs are only needed if the primitive mechanisms do not succeed in producing alignment. On this basis, they propose their own version of the Interactive Alignment account of dialogue alignment. In a dialogue system, the users are in a certain situation which has to be modelled. A situation model as introduced by Pickering & Garrod is described as a multi-dimensional representation of the situation under discussion (Johnson-Laird 1983; Sanford & Garrod 1981; van Dijk & Kintsch 1983; Zwaan & Radvansky 1998). According to Zwaan and Radvansky, the key dimensions encoded in situation models are space, time, causality, intentionality, and reference to main individuals under discussion. This is clearly relevant for the current research. Although Pickering & Garrod criticise approaches which propose two situation models, one for the speaker and one for the hearer, because they are too complicated and inefficient. But the criterion of complexity and efficiency are not clear. For a dialogue system in which new information has to be communicated, this criticism is not justified. There are also other situations in which two models may be needed: for deception, lying, hiding confidential information. Therefore full alignment of the situation models may not be possible. Lack of alignment also occurs when misunderstandings happen. So misalignment may have to be tolerated, and error-correction mechanisms may be needed. In the present study, the central questions will be tackled: how (or to which extent) the dialogue system can align with the key dimensions of the situation model, namely space, time, causality, intentionality, and reference to main individuals under discussion. If the system in the emergency call centre is able to align to these dimensions with a high degree of accuracy, then it should be able to put the appropriate priority to the phone call and classify the call, as well as following instructions about the emergency location: this is situation model alignment. The situation model provides a set of features for the TRIGGER_SITUATION part of the model presented in the Introduction. 16 Communicative Alignment of Synthetic Speech – Jolanta Bachan In an extreme case if two people are in very different associations, such as a stressed caller and a call-centre employee, or if two people come from different cultures and speak different languages, it is still possible for them to align their situation models through explicit negotiation (Brennan & Clark 1996; Clark & Wilkes-Gibbs 1986; Garrod & Anderson 1987; Schober 1993). According to Pickering & Garrod (2004), the global alignment of the situational models seems to result from the local alignment at the level of the linguistic representations being used, and they propose that this kind of alignment works via a priming mechanism: If a hearer hears an utterance that activates a particular representation, then priming creates an expectation that makes it more likely that the hearer will subsequently produce an utterance that uses that representation when he takes on the speaker role. This kind of interactive priming becomes an essential part of Pickering & Garrod’s approach to alignment. The starting point for the Pickering & Garrod approach was apparently Garrod and Anderson (1987), who introduced a principle of output/input coordination: in a maze game task, players tended to make the same semantic and pragmatic choices that held for the utterances that they had just heard. In other words, what they said, i.e. their outputs, tended to match what they heard, i.e. their inputs at the level of the situation model. During the course of interaction the semantic and pragmatic representations used for producing output and processing input became aligned. The studies provide (cf. Garrod & Anderson 1987, Brown-Schmitt et al. 2005) evidence for alignment of situation models in comprehension. The conclusion to be drawn for the present study is the interesting fact that if there is a factor constraining the speaker’s situation model, it also constrains the listener’s situational model. 2.2.7 Levels of alignment In the Introduction, alignment was defined with reference to different levels of language, and in the literature relations such as repetition and imitation are mentioned in this connection. Transcriptions of dialogues (see the corpus linguistic study in Chapter 4) contain numerous number of repeated linguistic elements and structures, which indicates that there is alignment not only of the situational model, but also at other levels (Aijmer 17 Communicative Alignment of Synthetic Speech – Jolanta Bachan 1996; Schenkein 1980; Tannen 1989). As Pickering & Garrod point out, the following levels may become aligned during dialogue: 1. Lexicon: the same expressions tend to be used while referring to particular objects; the expressions become shorter an more similar when used with the same interlocutor and get modified if the interlocutor changes (Brennan & Clark 1996; Clark & Wilkes-Gibbs 1986; Wilkes-Gibbs & Clark 1992). 2. Syntax: interlocutors tend to use the same syntactic structures ( Branigan et al. 2000) 3. Phonetics: the articulation of interlocutors’ repeated expressions becomes increasingly reduced, i.e. the expressions developed during a dialogue are shortened and harder to recognise when heard in isolation. Additionally, interlocutors tend to align accent and speech rate (Giles et al. 1992; Giles & Powesland 1975). 4. Semantics and pragmatics: some evidence on comprehension was provided by Levelt and Kelter (1982, Experiment 6) in which subjects were presented with the question-answer pairs and their task was to assess their naturalness. Pairs in which repeated form was used got the best scores. This suggests that people prefer to get responses aligned with their own form. Pickering and Garrod (2004) say that in successful dialogue the interlocutors develop aligned situation models and aligned representations at all linguistic levels. Additionally, priming at one level leads to priming at other levels. However, Pickering and Garrod are not very precise about the formal properties of semantic alignment, and they do not underline the importance of alignment on the semantic level being essential for successful communication. Also, they do not deal with cooperative non-alignment, where one person is stressed and the other person does not align but tries to persuade the first to align on the stress-free person, and which is required for scenarios in the present research. 18 Communicative Alignment of Synthetic Speech – Jolanta Bachan 2.2.8 Error trapping with misalignment An important activity in dialogue is error trapping, in this case recovery from a state of misalignment, when the interlocutors interpretations of utterances differ, for instance with ambiguities. In dialogue it happens that people use the same name, but they think of two different people. These interlocutors align on the superficial level, but their situation model is misaligned. In such cases the interlocutors need to use recovery mechanisms which will help them establish alignment, i.e. establish who is the person they refer to. The recognition of errors and the treatment of errors is a necessary property of a spoken dialogue system. 2.3 Communicative signs: function and processing Communication uses signs, and alignment means the alignment of signs with all their properties which are involved in communication. Alignment processes cover syntax, semantics and pragmatics. Therefore understanding what alignment is also depends on understanding what a sign is. The de Saussure sign model (1913) is shown in Figure 2, which shows the meaning- form (signifié-signifiant) relation, which de Saussure sees as a mental relation between the concept and the sound image. The picture in the middle illustrates the relation. Figure 2: The Saussurean sign model 2.3.1 Levelt & Schriefers’s ‘sign pie’ The Levelt & Schriefers (1987) model, which is known as the ‘sign pie’, has three components, unlike de Saussure’s model, which has two components. The third component is syntax, which answers a criticism of de Saussure’s model (and the models of Bühler and Jakobson) which do not explicitly contain a syntax component. The sign pie model, which is also a mental model, is visualised in Figure 3. 19 Communicative Alignment of Synthetic Speech – Jolanta Bachan Figure 3: Levelt & Schriefers's ‘sign pie’ (1987:396). Levelt & Schriefers 1987: 396) point out: An item’s syntactic properties always play a crucial role in the sentence generation process. They determine the syntactic environments that must be realized if that item is to be used, and these in turn impose constraints on the syntactic properties of further items to be retrieved. Or to put it differently: where concepts clearly serve as input for lexical access in speech production, yielding sound images as output, syntax plays both input and output roles. Examples of the importance of syntax are found with prepositions, which may depend more on grammatical relations than on meaning relations. The Levelt & Schriefers model is used as the basis for a model of activation in communication, as shown in Figure 4. Figure 4: Levelt & Schriefers image of the activation of a linguistic sign in speech production (Levelt & Schriefers 1987: 396) The extended model of Levelt and Schrievers shows a move from the language-as- product view of traditional sign models to the language-as-action approach which is 20 Communicative Alignment of Synthetic Speech – Jolanta Bachan necessary in psycholinguistics and speech technology. Pickering & Garrod comment on the language-as-product tradition as follows: The language-as-product tradition is derived from the integration of information- processing psychology with generative grammar and focuses on mechanistic accounts of how people compute different levels of representation. (Pickering & Garrod 2004: 170) They point out that in the language-as-action tradition utterances are interpreted with respect to a particular context and takes into account the goals and intentions of the participants. This tradition has typically considered processing in dialogue using apparently natural tasks (e.g., Clark 1992; Fussell & Krauss 1992). (Pickering & Garrod (2004: 170) Finally they compare the two traditions: Whereas psycholinguistic accounts in the language-as-product tradition are admirably well-specified, they are almost entirely decontextualized and, quite possibly, ecologically invalid. On the other hand, accounts in the language-as-action tradition rarely make contact with the basic processes of production or comprehension, but rather present analyses of psycholinguistic processes purely in terms of their goals (e.g., the formulation and use of common ground; Clark 1985; Clark 1996; Clark & Marshall 1981). (Pickering & Garrod (2004: 170) Although Pickering & Garrod claim that the product approach is not relevant for alignment, this is not true in the context of computation. A product is at the same time a result of processing, and also an input for processing. In spoken dialogue, one product (for example a situation model) is changed into another product (a modified situation model) by processing. So the two approaches are not as incompatible as Pickering & Garrod claim. The Levelt model is extended in other work. The Levelt production model has three main components, and is the planning, formulation (with two subcomponents) and articulation components (Table 2). 21 Communicative Alignment of Synthetic Speech – Jolanta Bachan Table 2: Processing modules in speech generation and their relation to phases of lexical access (Levelt & Schriefers 1987: 398) Processor Input Output Relation to Lexical Access Conceptualiser communicative intention preverbal message creating a lexical item’s conceptual conditions Grammatical encoder preverbal message surface structure retrieval of lemma, i.e. making the item’s syntactic properties available, given appropriate conceptual or syntactic conditions Sound form encoder surface structure phonetic or articulatory plan for utterance retrieval of the lexeme, i.e. the item’s stored sound form specifications, and its phonological integration in the articulatory plan Articulator phonetic plan overt speech executing the item’s context-dependent articulatory program The Formulator (Figure 5), which is the most relevant component in this context, is characterised as follows by Levelt (1992): In speech production the formulator is described as a process whose input is the lexical concept (the message) and whose output is a phonetic or articulatory plan for the item. The appropriate item for the mental lexicon is selected and is integrated into the developing grammatical encoding. An articulatory program is created for the selected lexical item on the basis of its stored phonological code and the phonological context of the utterance as a whole. Figure 5: An outline of lexical access in speech production (Levelt 1992: 4) 22 Communicative Alignment of Synthetic Speech – Jolanta Bachan 2.3.2 The revised Interactive Alignment model of dialogue processing According to Pickering and Garrod (2004: 175) the interactive alignment model assumes that successful dialogue involves the development of aligned representations by the interlocutors. This occurs by priming mechanisms at each level of linguistic representation, by percolation between the levels so that alignment at one level enhances alignment at other levels, and by repair mechanisms when alignment goes awry. Figure 6 illustrates the alignment process. The linguistic levels of two interlocutors are linked. In Figure 6, A and B represent two interlocutors in a dialogue in this schematic representation of the stages of comprehension and production processes according to the interactive alignment model. The horizontal links show the channels by which alignment takes place at these levels by means of the Priming mechanism, including lexical priming, syntactic priming, etc. Figure 6: Schematic representation of the stages of comprehension and production processes according to the interactive alignment model (Pickering & Garrod 2004: 176) 23 Communicative Alignment of Synthetic Speech – Jolanta Bachan The interactive alignment model does not apply in this way to monologues, whose goal is not to become aligned with the listener, although indirect alignment (with previously experienced communication) may occur. In a monologue the speaker tries to formulate the message in such a way that the listener can obtain the appropriate representation corresponding to the speaker’s message. The important fact is that in monologue (including writing) the speaker’s and the listener’s representations may never align, the automatic mechanism of alignment is not present. The alignment mechanism occurs only when the speaker gets regular feedback from the interlocutors and on the basis of this he or she can control the alignment process. In dialogue, priming is the central mechanism in the process of alignment and mutual understanding. Thus dialogue indicates the important functional role of priming (Pickering and Garrod 2004). The process of interactive alignment by priming is supported by further factors: 1. The use of routine procedures in dialogue. 2. The use of implicit common ground (background knowledge which is assumed to be shared) and explicit common ground (which is mentioned in the dialogue). Pickering & Garrod discuss an alternative model, the autonomous transmission model, in which the transfer of information between producers and comprehenders takes place via decoupled production and comprehension processes that are isolated from each other (see Figure 7). Communication takes place only through the acoustic medium and the messages are constructed independently by the speaker and the hearer. Pickering and Garrod (2004) say that the autonomous transmission model is not appropriate for dialogue. In dialogue the production and comprehension processes are coupled and this is the core of the interactive alignment model. However, it is not clear how the interactive alignment model can be represented in a precise model: the horizontal connections between levels do not exist independently of the physical signal transmission. Therefore, contrary to what Pickering and Garrod claim, the interactive alignment processes at different levels in the overall alignment procedure can only be reconstructed from an autonomous transmission model of physical contact via speech. 24 Communicative Alignment of Synthetic Speech – Jolanta Bachan Figure 7: Schematic representation of the stages of comprehension and production processes according to the autonomous transmission account (Pickering & Garrod 2004: 177) 2.4 Speech acts and dialogue acts Austin (1962) presented two theories of speech acts. In the first theory, he distinguishes between constative and performative utterances. Constative utterances can be true or false, as in traditional propositions. Performative utterances cannot be true or false, but perform some action, for example questions, commands, promises, etc. In the second theory, the functions or ‘force’ of utterances were treated, and it was claimed that there are no basic distinctions between constative and performative utterances, which all share certain types of force: 1. locutionary force (propositional content of utterance – predicates and arguments), 2. illocutionary force (conventional use of utterances to create social links between the interlocutors), “doing things with words” - “hereby” = “niniejszym”, illocutionary verbs = speech act verbs, 25 Communicative Alignment of Synthetic Speech – Jolanta Bachan 3. perlocutionary force (effect the utterance has on the hearer). The forces are indicated by language forms and structures: 1. word order, 2. stress, 3. intonation contour, 4. punctuation, 5. the mood of the verb, 6. the so-called performative verbs (e.g. ‘say’, ‘tell’, ‘confess’, ‘promise’, ‘warn’, ‘baptise’). Searle (1969) extended and modified Austin’s theory and developed 9 constitutive rules which define successful utterance, which define the role of the speaker and his beliefs about the hearer in producing a successful utterance, for which he distinguishes between utterance acts (produced words), propositional acts (assigning meaning to the utterance acts) and illocutionary acts (similar to Austin’s ‘illocutionary force’). Searle gives the example of ‘promise’: Given that a speaker S utters a sentence T in the presence of a hearer H, then, in the literal utterance of T, S sincerely and non-defectively promises that p to H if and only if the following conditions 1-9 obtain. (Searle 1969: 57) Searle’s formulation of the felicity conditions for promising (1969) are: 1. Normal input and output conditions obtain. 2. S expresses the proposition that p in the utterance of T. 3. In expression that p, S predicates a future act A of S. 4. H would prefer S’s doing A to his not doing A, and S believes H would prefer his doing A to his not doing A.. 5. It is not obvious to both S and H that S will do A in the normal course of events. 6. S intends to do A. 7. S intends that the utterance of T will place him under an obligation to do A. 8. S intends (i-1) to produce in H the knowledge (K) that the utterance of T is to count as placing S under an obligation to do A. S intends to produce K by means of the 26 Communicative Alignment of Synthetic S