Faculty of Mathematics and Computer Science Adam Mickiewicz University, Poznań Micha l Junczyk, MSc Application of speech datasets management methods for the evaluation of Automatic Speech Recognition systems for Polish. PhD thesis Supervisor: prof. dr hab. Krzysztof Jassem Discipline of science: Computer and information sciences Field of science: Natural sciences Wydzia l Matematyki i Informatyki Uniwersytet im. Adama Mickiewicza w Poznaniu mgr inż. Micha l Junczyk Zastosowanie metod zarządzania zbiorami nagrań mowy do oceny jakości systemów automatycznego rozpoznawania mowy dla języka polskiego. Rozprawa doktorska Promotor: prof. dr hab. Krzysztof Jassem Dyscyplina naukowa: Informatyka Dziedzina nauki: Nauki ścis le i przyrodnicze Declaration I, Micha l Junczyk, declare that the work in this dissertation titled "Application of speech data sets management methods for the evaluation of Automatic Speech Recognition systems for Polish.” is carried out by me. This work has not been submitted to Adam Mickiewicz University or any other educational institution for the award of a degree or educational qualification. The information published in this dissertation has been obtained and pre- sented in accordance with academic rules and ethical conduct. Information obtained from other sources has been referenced appropriately. i Dedication I am grateful to many who supported me throughout this PhD journey. I greatly appreciate the unwavering support, insightful feedback, and patient guidance of my supervisor prof. dr hab. Krzysztof Jassem. I thank the leadership and staff of the Department of Mathematics and Computer Science and the Doctoral School of Exact Sciences for a supportive and stimulating envi- ronment. I am grateful for the feedback and support from my mentors and colleagues at Sam- sung and Allegro, especially dr Miko laj Wypych, dr inż. Bartosz Broda, dr inż. Marcin Sowański, mgr inż. Ireneusz Gawlik, mgr inż. Robert Mroczkowski, dr Aleksander Wawer, dr inż. Pawe l Zawistowski and last, but not least, mgr inż. Pawe l Cyrta. A heartfelt thank you to my parents for their lifelong support and belief in me. Lastly, to my beloved wife and children: your patience, sacrifices, and unwavering support made this journey possible. ii Abstract Automatic speech recognition (ASR) systems transform spoken language into written text, enabling virtual assistants, transcription tools and intelligent home control. These systems rely on large and diverse speech data sets that reflect the linguistic and acoustic charac- teristics of the target population and user group. The Polish language, spoken by over 50 million people, presents ASR with unique challenges and opportunities due to its rich phonetic and morphological structure. Public domain speech datasets are often underutilized due to discoverability and in- teroperability issues. Limited access to evaluation datasets makes it difficult to verify and replicate the quality tests of ASR systems. Comprehensive assessment of multiple ASR systems requires an efficient data management structure. This study addresses these issues by creating comprehensive, accessible, and actively maintained datasets, promoting best practices in ASR benchmarking inspired by international standards. The study examined and cataloged 53 publicly available speech datasets, organized the dataset from 24 sources, and developed a quality assessment process for ASR systems. The curated dataset includes nearly 400,000 recordings and over 800 hours of speech from 5,000 speakers. Selected recordings were used to compare 7 ASR systems and 25 models. The research revealed significant differences in the performance of ASR systems in various test scenarios. All resources and results have been made publicly available to promote transparency, peer review, and collaboration within the research community. This study improved methods for data management and benchmarking of ASR systems. The comprehensive review and catalog increased the discoverability of Polish ASR speech datasets, and the curated BIGOS and PELCRA datasets provided an extensive resource of diverse speech recordings. The use of Polish ASR datasets for comparative purposes has increased threefold compared to previous studies. Improved documentation and analysis iii of the understanding of the test data and the availability of data sets and assessment tools will positively impact the ability to validate and compare test results. The development of a data management methodology and a benchmarking system has improved reliable assessments and comparative analyzes of ASR systems and understanding of the strengths and weaknesses of ASR systems for Polish. To sum up, the conducted research has a positive impact on the practical usefulness of Polish ASR datasets for academic and industrial applications. They also contribute to the promotion of methods, tools, and good practices used for the benchmarking of ASR systems. iv Contents Glossary xix 1 Introduction 1 1.1 Problem background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 The role of datasets in the training and evaluation of machine learn- ing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 The role of speech datasets in the training and evaluation of ASR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Challenges in ASR speech dataset management . . . . . . . . . . . . 5 1.1.4 Challenges in ASR evaluation . . . . . . . . . . . . . . . . . . . . . . 7 1.1.5 State of the ASR speech datasets and ASR evaluation for Polish . . 9 1.2 Research aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Research hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Research objectives and questions . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Research scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 Methodology adopted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Literature Review 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Benchmarking of Machine Learning Systems . . . . . . . . . . . . . . . . . . 17 2.2.1 Challenges in ML benchmarking . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Examples of methods for curating ML benchmarking datasets . . . . 19 2.3 Benchmarking of Automatic Speech Recognition Systems . . . . . . . . . . 24 v 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Overview of ASR benchmark design considerations . . . . . . . . . . 25 2.3.3 ASR use scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Technical challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.6 Evaluation results analysis . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 ASR speech datasets management methods and tools . . . . . . . . . . . . . 44 2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.2 ASR speech dataset lifecycle . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.3 Overview of the ASR dataset management methods . . . . . . . . . 46 2.4.4 Challenges related to ASR speech datasets management . . . . . . . 48 2.5 ASR speech datasets and benchmarks for Polish . . . . . . . . . . . . . . . . 50 2.5.1 ASR speech datasets for Polish . . . . . . . . . . . . . . . . . . . . . 50 2.5.2 ASR speech benchmarks for Polish . . . . . . . . . . . . . . . . . . . 50 2.6 Overview of tools for dataset management and ASR evaluation . . . . . . . 52 2.6.1 ASR speech datasets management tools . . . . . . . . . . . . . . . . 52 2.6.2 ASR evaluation tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3 Methodology 56 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 56 3.2.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 61 3.3.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 61 3.3.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Dataset analysis process . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3.4 Dataset release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.4 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 77 3.4.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 77 3.4.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5 RO4: Design and implementation of a system for ASR systems benchmarking 80 vi 3.5.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 80 3.5.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.6 RO5: Use of curated dataset for benchmarking ASR systems for Polish . . . 85 3.6.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 85 3.6.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7 RO6: Organization of competition for the ASR community . . . . . . . . . . 93 3.7.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 93 3.7.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.8.1 Overview of the data management framework . . . . . . . . . . . . . 93 4 Results 96 4.1 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 96 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.1.2 ASR speech datasets survey results overview . . . . . . . . . . . . . 96 4.1.3 ASR speech data survey results . . . . . . . . . . . . . . . . . . . . . 97 4.1.4 Survey availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 104 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.2.2 Datasets features derived from the documentation . . . . . . . . . . 105 4.2.3 Datasets features derived from the analysis of datasets contents . . . 108 4.2.4 Availability of curated datasets . . . . . . . . . . . . . . . . . . . . . 114 4.3 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 115 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.4 RO5: Use of curated dataset for benchmarking ASR systems for Polish . . . 126 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.2 Evaluation setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.3 Evaluation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.4.4 Reference and ASR Transcripts Normalization . . . . . . . . . . . . . 145 4.4.5 Evaluation results sharing . . . . . . . . . . . . . . . . . . . . . . . . 147 4.5 RO6: Organization of open competition for the ASR community . . . . . . 148 vii 4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.5.2 Program selection and task creation . . . . . . . . . . . . . . . . . . 149 4.5.3 Comparison of community ASR solutions with other systems for Polish149 5 Discussion 151 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 152 5.2.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.2.2 Observations from community feedback . . . . . . . . . . . . . . . . 155 5.2.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.2.5 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . 158 5.3 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 160 5.3.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.4 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 166 5.4.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.5 RO4: Design and implementation of system for ASR systems benchmarking 169 5.5.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.6 RO5: Using a curated dataset to benchmark ASR systems for Polish . . . . 170 5.6.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.6.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.6.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.6.4 Methodological gaps in ASR benchmarking addressed in this study . 187 5.6.5 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . 188 5.7 RO6: Organization of competition for the ASR community . . . . . 190 6 Conclusion 191 6.1 Main Research Questions and Answers . . . . . . . . . . . . . . . . . . . . . 191 6.2 Contributions and achievements . . . . . . . . . . . . . . . . . . . . . . . . . 194 viii 6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.4 Research Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7 Appendix 198 7.1 ASR speech datasets survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.1.1 Attributes of speech datasets catalog . . . . . . . . . . . . . . . . . . 198 7.1.2 Attributes of ASR benchmarks survey . . . . . . . . . . . . . . . . . 203 7.1.3 Freely available speech datasets for Polish ASR . . . . . . . . . . . . 206 7.1.4 Commercially available speech datasets for Polish ASR . . . . . . . . 207 7.1.5 Dataset subsets cards . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.1.6 Commercial ASR systems pricing . . . . . . . . . . . . . . . . . . . . 223 7.1.7 Freely available ASR models sizes . . . . . . . . . . . . . . . . . . . . 223 7.1.8 Call for participation in 2024 Polish ASR challenge . . . . . . . . . . 224 ix List of Figures 2.1 Internal and external issues identified in the ML evaluation practices. Source: [72] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 ML tasks and learning problems universe. Source: [72] . . . . . . . . . . . . 19 2.3 Examples of WER sliced into groups A, B, and C, with the width of the bars reflecting relative sizes of those groups. Source: [2] . . . . . . . . . . . 43 2.4 ASR speech dataset lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Overall research framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Process of analysis of curated datasets. . . . . . . . . . . . . . . . . . . . . . 76 3.3 ASR evaluation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4 ASR evaluation process data flow . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5 BIGOS data management framework . . . . . . . . . . . . . . . . . . . . . . 94 4.1 Normalized cumulative size of Polish ASR speech datasets . . . . . . . . . . 98 4.2 ASR benchmark results — POLSL dataset. Source: [99]. . . . . . . . . . . . 119 4.3 ASR benchmark results — BOR dataset scenario 1, year 2018. Source: [99] 120 4.4 ASR benchmark results — PolEval year 2019. Source: [59] . . . . . . . . . . 121 4.5 ASR benchmark results — DiaBiz corpus, year 2022. Source: [110] . . . . . 122 4.6 ASR benchmark results — Whisper, MLS corpus, year 2022. Source: [117] . 122 4.7 ASR benchmark results — Whisper, CommonVoice corpus, year 2022. Source: [117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.8 ASR benchmark results — Whisper, VoxPopuli corpus, year 2022. Source: [117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.9 ASR benchmark results — Whisper, FLEURS corpus, year 2022. Source: [117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 x 4.10 ASR evaluation results — SpokesBiz corpus, year 2023. Source: [111] . . . . 123 4.11 ASR benchmark results — Accuracy of medical terms recognition, Kuligowska et. al, year 2023. Source: [68] . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.12 ASR benchmark results — Recognition errors classification, Kuligowska et. al, year 2023. Source: [68] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.13 ASR benchmark results — Medical terms recognition, Zielonka et. al, year 2023. Source: [153] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.14 Mean WER per system for all BIGOS dataset subsets. . . . . . . . . . . . . 129 4.15 Mean WER for all PELCRA dataset subsets. . . . . . . . . . . . . . . . . . 129 4.16 Box plot of WER for all systems per specific subset of BIGOS dataset. . . . 131 4.17 Box plot of WER for all systems per specific subset of PELCRA dataset. . . 132 4.18 Comparison of WER for the most accurate free and paid ASR systems. BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.19 Comparison of WER for the least accurate free and paid ASR systems. BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.20 Comparison of WER for the most accurate free and paid ASR systems. PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.21 Comparison of WER for the least accurate free and paid ASR systems. PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.22 WER for freely available systems for various model sizes. BIGOS dataset. . 138 4.23 WER for freely available systems for various model sizes. PELCRA dataset. 139 4.24 Average WER in function of audio duration. BIGOS dataset. . . . . . . . . 140 4.25 Mean WER in function of audio duration. PELCRA dataset. Best paid and free systems. The size of the point corresponds to the number of samples. . 141 4.26 Mean WER in function of speech rate for top systems. BIGOS dataset. . . 142 4.27 Mean WER in function of speech rate for top systems. PELCRA dataset. . 142 4.28 Standard deviation in WER across speaker age groups. PELCRA dataset. . 144 4.29 Impact of normalization on error rates on BIGOS dataset. . . . . . . . . . . 145 4.30 Impact of normalization on error rates on PELCRA dataset. . . . . . . . . . 146 4.31 Management framework extension to incorporate results from PolEval open challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xi 5.1 WER scores of top systems in Polish ASR benchmarks. . . . . . . . . . . . 167 5.2 Number of datasets and vocabulary domains in Polish ASR benchmarks. . . 188 5.3 Number of evaluated models in Polish ASR benchmarks. . . . . . . . . . . . 189 5.4 Number of dataset-system-model combinations in Polish ASR benchmarks . 189 xii List of Tables 2.1 ASR use scenarios overview. Inspired by work of Aksenova et al.[2] . . . . . 28 2.2 Types of sources, recipients, and modes for various ASR use scenarios. In- spired by work of Aksenova et al.[2] . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Vertical aspects of ASR challenges. Inspired by: [2] . . . . . . . . . . . . . . 29 2.4 Practical challenges of ASR evaluation process . . . . . . . . . . . . . . . . 31 2.5 Metrics used for ASR evaluation . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Differences between WER, MER and WIL values for different input/output combinations. Source: [90] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Evaluation results analysis dimensions . . . . . . . . . . . . . . . . . . . . . 44 2.8 Stages and methods of speech data management . . . . . . . . . . . . . . . 46 2.9 Tools for ASR datasets management . . . . . . . . . . . . . . . . . . . . . . 53 2.10 Tools for ASR evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1 Attributes of datasets and their relevance to ASR evaluation . . . . . . . . . 64 3.2 Sample of PELCRA SNUV references and ASR outputs . . . . . . . . . . . 70 3.3 Overview of factors enhancing specific dataset utility for ASR evaluation purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4 Overview of factors decreasing datasets’ utility for ASR evaluation purposes. 71 3.5 Meta-data and partitioning of source datasets — BIGOS dataset . . . . . . 72 3.6 Meta-data and partitioning of source datasets — PELCRA dataset . . . . . 73 3.7 Meta-data and partitioning of source datasets . . . . . . . . . . . . . . . . . 73 3.8 Meta-data and partitioning of source datasets . . . . . . . . . . . . . . . . . 74 3.9 Attributes in the BIGOS utterance data object . . . . . . . . . . . . . . . . 74 3.10 Metrics used for analysis of datasets contents. . . . . . . . . . . . . . . . . . 76 xiii 3.11 Design considerations for ASR evaluation system . . . . . . . . . . . . . . . 81 3.12 Methods of normalizing references and hypotheses . . . . . . . . . . . . . . 84 3.13 Evaluation scenarios and their analysis dimensions . . . . . . . . . . . . . . 87 3.14 Relation between research question and evaluation scenarios. . . . . . . . . . 87 3.15 Whisper model types. Source: Whisper model card. . . . . . . . . . . . . . 89 3.16 ASR systems evaluated in the study. . . . . . . . . . . . . . . . . . . . . . . 91 3.17 Evaluated ASR systems usage cost and license type. . . . . . . . . . . . . . 92 4.1 Polish ASR datasets survey summary . . . . . . . . . . . . . . . . . . . . . . 97 4.2 Summary of audio dataset availability and characteristics by year . . . . . 98 4.3 Institutions contributing speech datasets for Polish. . . . . . . . . . . . . . . 100 4.4 Data catalogs and platforms hosting ASR speech datasets for Polish . . . . 101 4.5 Audio devices for all available datasets . . . . . . . . . . . . . . . . . . . . . 102 4.6 Audio devices for publicly available datasets . . . . . . . . . . . . . . . . . . 102 4.7 Audio devices for commercially available datasets . . . . . . . . . . . . . . . 103 4.8 Distribution of sampling rate for publicly reported ASR speech datasets for Polish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.9 Distribution of speech types for publicly reported ASR speech datasets for Polish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.10 Speaker and recordings meta-data availability in available speech datasets . 104 4.11 Summary statistics of curated datasets . . . . . . . . . . . . . . . . . . . . . 105 4.12 BIGOS dataset subset license and language coverage. . . . . . . . . . . . . . 105 4.13 PELCRA for BIGOS dataset subset license and language coverage. . . . . . 106 4.14 BIGOS dataset subset domains and speech types. . . . . . . . . . . . . . . . 106 4.15 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 107 4.16 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 107 4.17 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 108 4.18 Audio content size metrics for BIGOS dataset . . . . . . . . . . . . . . . . . 108 4.19 Audio content size metrics for PELCRA dataset . . . . . . . . . . . . . . . . 109 4.20 Text content size metrics for BIGOS dataset . . . . . . . . . . . . . . . . . . 109 4.21 Text content size metrics for PELCRA dataset . . . . . . . . . . . . . . . . 110 4.22 Text content features for BIGOS dataset . . . . . . . . . . . . . . . . . . . . 110 xiv 4.23 Text content features for PELCRA dataset . . . . . . . . . . . . . . . . . . 111 4.24 Audio content features for BIGOS dataset . . . . . . . . . . . . . . . . . . . 111 4.25 Audio content features for PELCRA dataset . . . . . . . . . . . . . . . . . . 112 4.26 Average duration of audio recordings and utterances — BIGOS dataset. . . 112 4.27 Average duration of audio recordings and utterances — PELCRA dataset. . 113 4.28 Coverage of speaker meta-data — BIGOS dataset . . . . . . . . . . . . . . . 113 4.29 Coverage of speaker meta-data — PELCRA dataset . . . . . . . . . . . . . . 114 4.30 Publication date and number of downloads of BIGOS datasets as of June 6th 2024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.31 Overview of sections providing relevant results to research questions RQ7- RQ13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.32 Overview of ASR use-cases covered in Polish ASR benchmarks to date. . . . 116 4.33 Public domain ASR benchmarks 2018-2023. . . . . . . . . . . . . . . . . . . 116 4.34 Overview of domains, speech types, audio sources and recording devices. . . 117 4.35 Datasets size and number of domains, recordings, and speakers. . . . . . . . 117 4.36 Acoustic conditions, annotations, and speaker meta-data across Polish ASR benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.37 Overview of metrics employed in Polish ASR systems benchmarks. . . . . . 118 4.38 Publicly reported evaluations of ASR models for Polish language. . . . . . . 119 4.39 Number of reported independent evaluations and benchmarks per system. . 119 4.40 ASR systems supporting Polish not yet evaluated in the public domain. . . 120 4.41 Types of ASR systems evaluated in public domain ASR benchmarks 2018- 2023. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.42 ASR benchmarks performed in this study. . . . . . . . . . . . . . . . . . . . 126 4.43 ASR systems evaluation scenarios overview . . . . . . . . . . . . . . . . . . 126 4.44 Evaluation details for BIGOS dataset . . . . . . . . . . . . . . . . . . . . . . 127 4.45 Evaluation details for PELCRA dataset . . . . . . . . . . . . . . . . . . . . 127 4.46 WER statistics – BIGOS dataset . . . . . . . . . . . . . . . . . . . . . . . . 128 4.47 WER statistics – PELCRA dataset . . . . . . . . . . . . . . . . . . . . . . . 130 4.48 WER statistics for all ASR systems and specific subsets of BIGOS dataset. 131 4.49 WER statistics for all ASR systems and specific subsets of PELCRA set. . . 132 xv 4.50 WER statistics for free and paid ASR systems on BIGOS dataset. . . . . . 133 4.51 Best and worse systems for BIGOS dataset. . . . . . . . . . . . . . . . . . . 133 4.52 WER statistics for the most accurate free and commercial systems. BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.53 WER statistics for the most accurate free and commercial systems. BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.54 WER statistics for free and paid ASR systems evaluated on PELCRA dataset.135 4.55 Best and worst accurate systems for PELCRA dataset. . . . . . . . . . . . . 135 4.56 The most accurate systems WER statistics. PELCRA dataset. . . . . . . . 136 4.57 The least accurate systems WER statistics. PELCRA dataset. . . . . . . . 136 4.58 Average WER for free systems with information about model size. . . . . . 137 4.59 Average WER for free systems with information about model size. . . . . . 138 4.60 Mean WER for specific audio duration ranges. BIGOS dataset. Best paid and free systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.61 Mean WER for specific audio duration ranges for top paid and free systems. PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.62 Number of samples with speaker gender information. . . . . . . . . . . . . . 143 4.63 Values and differences in mean WER scores per speaker gender. . . . . . . . 144 4.64 Number of samples with speaker gender information. . . . . . . . . . . . . . 145 4.65 Values and differences in average WER scores per speaker gender for 689 samples from PELCRA dataset per gender. . . . . . . . . . . . . . . . . . . 146 4.66 Mean WER across systems and age ranges. PELCRA dataset. . . . . . . . . 147 4.67 Standard Dev. and maximum difference in WER across age groups. PEL- CRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.68 Reduction of error rates caused by normalization of references and hypoth- esis for BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.69 Reduction of error rates caused by normalization of references and hypoth- esis for PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1 Benchmark dataset design requirements validation . . . . . . . . . . . . . . 165 5.2 Evaluated models, datasets, and their combinations. . . . . . . . . . . . . . 187 xvi 7.1 Publicly and freely available speech datasets for Polish. . . . . . . . . . . . . 206 7.2 Commercially available speech datasets for Polish. . . . . . . . . . . . . . . 207 7.3 Dataset size per split — Clarin Mobile. . . . . . . . . . . . . . . . . . . . . . 207 7.4 Dataset features per split — Clarin Mobile. . . . . . . . . . . . . . . . . . . 207 7.5 Dataset size per split — Clarin Studio. . . . . . . . . . . . . . . . . . . . . . 208 7.6 Dataset features per split — Clarin Studio. . . . . . . . . . . . . . . . . . . . 208 7.7 Dataset size per split — MLS. . . . . . . . . . . . . . . . . . . . . . . . . . . 208 7.8 Dataset features per split — MLS. . . . . . . . . . . . . . . . . . . . . . . . 209 7.9 Dataset size per split — Munich AI Labs Librivox. . . . . . . . . . . . . . . 209 7.10 Dataset features per split — Munich AI Labs Librivox. . . . . . . . . . . . . 209 7.11 Dataset size per split — Common Voice. . . . . . . . . . . . . . . . . . . . . 210 7.12 Dataset features per split — Common Voice. . . . . . . . . . . . . . . . . . . 210 7.13 Dataset size per split — AZON read. . . . . . . . . . . . . . . . . . . . . . . 210 7.14 Dataset features per split — AZON read. . . . . . . . . . . . . . . . . . . . . 211 7.15 Dataset size per split — AZON spontaneous. . . . . . . . . . . . . . . . . . . 211 7.16 Dataset features per split — AZON spontaneous. . . . . . . . . . . . . . . . 211 7.17 Dataset size per split — PWR Maleset. . . . . . . . . . . . . . . . . . . . . . 212 7.18 Dataset features per split — PWR Maleset. . . . . . . . . . . . . . . . . . . 212 7.19 Dataset size per split — PWR Shortwords. . . . . . . . . . . . . . . . . . . . 212 7.20 Dataset features per split — PWR Shortwords . . . . . . . . . . . . . . . . . 213 7.21 Dataset size per split — PWR Very Important Utterances. . . . . . . . . . . 213 7.22 Dataset features per split — PWR Very Important Utterances. . . . . . . . 213 7.23 Dataset size per split — Google FLEURS. . . . . . . . . . . . . . . . . . . . 214 7.24 Dataset features per split — Google FLEURS. . . . . . . . . . . . . . . . . . 214 7.25 Dataset size per split — Minds-14. . . . . . . . . . . . . . . . . . . . . . . . 214 7.26 Dataset features per split — Minds-14. . . . . . . . . . . . . . . . . . . . . . 215 7.27 Dataset size per split — PolEval 22. . . . . . . . . . . . . . . . . . . . . . . . 215 7.28 Dataset features per split — PolEval 22. . . . . . . . . . . . . . . . . . . . . 215 7.29 Dataset size per split — Spokes Mix Emo. . . . . . . . . . . . . . . . . . . . 216 7.30 Dataset features per split — Spokes Mix Emo. . . . . . . . . . . . . . . . . . 216 7.31 Dataset size per split — Spokes Mix Luz. . . . . . . . . . . . . . . . . . . . . 216 xvii 7.32 Dataset features per split — Spokes Mix Luz. . . . . . . . . . . . . . . . . . 217 7.33 Dataset size per split — Spokes Mix Parl. . . . . . . . . . . . . . . . . . . . 217 7.34 Dataset features per split — Spokes Mix Parl. . . . . . . . . . . . . . . . . . 217 7.35 Dataset size per split — Spokes Biz Bio. . . . . . . . . . . . . . . . . . . . . 218 7.36 Dataset features per split — Spokes Biz Bio. . . . . . . . . . . . . . . . . . . 218 7.37 Dataset size per split — Spokes Biz Interviews. . . . . . . . . . . . . . . . . 218 7.38 Dataset features per split — Spokes Biz Interviews. . . . . . . . . . . . . . . 219 7.39 Dataset size per split — Spokes Biz Luz. . . . . . . . . . . . . . . . . . . . . 219 7.40 Dataset features per split — Spokes Biz Luz. . . . . . . . . . . . . . . . . . . 219 7.41 Dataset size per split — Spokes Biz Podcasts. . . . . . . . . . . . . . . . . . 220 7.42 Dataset features per split — Spokes Biz Podcasts. . . . . . . . . . . . . . . . 220 7.43 Dataset size per split — Spokes Biz Presentations . . . . . . . . . . . . . . . 220 7.44 Dataset features per split — Spokes Biz Presentations . . . . . . . . . . . . 221 7.45 Dataset size per split — Spokes Biz Various 1. . . . . . . . . . . . . . . . . . 221 7.46 Dataset features per split — Spokes Biz Various 1. . . . . . . . . . . . . . . 221 7.47 Dataset size per split — Spokes Biz Various 2. . . . . . . . . . . . . . . . . . 222 7.48 Dataset features per split — Spokes Biz Various 2. . . . . . . . . . . . . . . 222 7.49 Dataset size per split — Spokes Biz Interviews. . . . . . . . . . . . . . . . . 222 7.50 Dataset features per split — Spokes Biz Interviews. . . . . . . . . . . . . . . 223 7.51 Commercial ASR services pricing . . . . . . . . . . . . . . . . . . . . . . . . 223 7.52 Size of freely available ASR models . . . . . . . . . . . . . . . . . . . . . . . 224 xviii Glossary Machine Learning task An abstract problem statement, defined either in natural lan- guage or formally. Tasks vary in granularity, creating a hierarchy, such as “dog vs. cat classification” to “image classification.” They frame contributions in Machine Learn- ing field and are instantiated by Learning problems for evaluation. Examples include MNIST, CIFAR-10, and ImageNet for the “image classification” task.. 24 AMU ASR Leaderboard Publicly accessible leaderboard presenting results of ASR sys- tems benchmarking supporting Polish on BIGOS datasets. 56, 90, 149 ASR Automatic Speech Recognition (ASR) is a technology that enables machines to process speech input and translate it into text. Also known as Speech Recognition or Speech-to-Text (STT). 1, 10, 17, 24, 44 Audio encoding Manner in which digital audio signal is encoded for storage or transmis- sion. The most popular audio encodings for speech and ASR applications are lossless encodings of PCM or FLAC and lossy encodings of Opus and Speex . 29 AZON Repository of open data from Wroc law University of Technology (Atlas Zasobów Otwartej Nauki).1. 102 Benchmark A learning problem serving as an indicator of progress on a ML task. Bench- marks often include a leaderboard and open competition. For example, within the ILSVRC competition (ImageNet Large Scale Visual Recognition Challenge) increas- ing accuracy on ImageNet benchmark dataset reflects advancements in image classi- fication task.. 2 1AZON project xix https://zasobynauki.pl/projekt-azon-2/ Benchmark dataset A benchmark dataset is a curated, widely accepted reference used to evaluate and compare algorithms, models, or systems in a specific domain. It provides a consistent basis for comparison and objective assessment.[58, 9, 27, 96] Benchmark datasets with specific metrics are referred to as learning problems and represent more abstract tasks.[72]. 2, 11, 24, 63, 66, 69, 85, 160 benchmark saturation Phenomenon, which occurs when a learning problem becomes too easy for current ML models, leading to a plateau in performance. This can happen due to various reasons, such as overfitting to test data, advancements in technology or not challenging enough dataset or evaluation metric.. 20, 24, 62 BIGOS BIGOS stands for Benchmark Intended Grouping of Open Speech). BIGOS is a set of curated datasets intended to facilitate benchmarking of ASR systems. Cur- rently, BIGOS is focused on the Polish language. 2. x, 56, 63, 69, 70, 90, 94 CER Character Error Rate. xxiii, 54, 84, 169 Common Voice Large-scale, multilingual speech dataset collected with crowdsourcing via Mozilla Corporation. 2, 102, 123, 163, 171–173, 177 data curation Broad set of data management techniques, such as acquisition, formatting, documentation, enrichment, annotation or quality verification, aimed at improving the practical utility of datasets. 11 FLEURS Few-shot Learning Evaluation of Universal Representations of Speech. 123 Forced alignment Method of analyzing and synchronizing the speech content with its transcription to achieve temporal alignment. 53 Gated datasets Access to datasets on Hugging Face which allows authors to control dataset usage by requiring users to request access, providing their username and email. Authors can approve requests manually or automatically and may ask for addi- tional information.3https://huggingface.co/docs/hub/en/datasets-gatedHugging Face’s documentation on gated datasets. 77, 162, 163 2BIGOS in Polish means "cabbage stew". The name is inspired by the work of Google on SpeechStew[12] 3¸ xx GitHub Web-based platform for version control and collaborative software development using Git. Supports public and private repositories, and features like pull requests, issue tracking, and project wikis. 4. 54, 194 GMM Gaussian Mixture Model. 121 Hallucinations In ASR systems, hallucinations are outputs that do not match any spoken input. They can result from background noise, poor audio quality, or ASR model limitations, leading to incorrect transcriptions and reduced system reliability.. 175, 178 Hugging Face Datasets Python library by Hugging Face designed for efficient handling and processing of large datasets. Offers simple access to a wide range of datasets, tools for dataset loading, transformation, and evaluation. Supports various data formats and integration with machine learning workflows.5. 53, 55, 77, 81, 162, 169, 170 Hugging Face Hub Platform for hosting, sharing, and discovering machine learning models and datasets. Provides tools for collaborative development, discovering re- sources, version control, and deployment of models and datasets, enhancing accessi- bility and community engagement.6. 77, 162, 163, 194 JIWER Python library for calculating evaluation metrics for ASR systems such as WER based on minimum-edit distance between one or more reference and hypothesis sen- tences. 54, 170 Kaldi framework Toolkit for speech recognition written in C++, licensed under the Apache License v2.0 intended for speech recognition researchers. 121 Learning problem A learning problem consists of a dataset of (input, output) pairs and an evaluation metric to score solutions (functions from input to output). It is fully defined by these components without needing external semantics or data; for example, 4GitHub 5HF datasets 6HF Hub xxi https://github.com/ https://huggingface.co/docs/datasets/index https://huggingface.co/datasets the ILSVRC-2012 dataset (ImageNet) with top-1 accuracy as the metric.[72]. xix, 24 LibriVox Python library for music and audio analysis. Provides the building blocks necessary to create music information retrieval systems. 101 Librosa Python library for music and audio analysis.. 70 Machine Learning Development of algorithms and statistical models that perform spe- cific tasks without explicit instructions. These algorithms and models learn from and make predictions or decisions based on data. xix, 3, 17, 20, 24 MER Match Error Rate (MER) is calculated as the number of errors (insertions, deletions, and substitutions) divided by the number of words in the hypothesis, addressing the issue of unbounded WER in cases of high insertion errors.. 35, 54, 84, 169 ML evaluation set A subset of data used to assess the performance of machine learning models. It is separate from the training data and is utilized to provide an unbiased evaluation of a model’s accuracy, generalization, and effectiveness on unseen data.. 31 MLS Multilingual LibriSpeech. 2, 123, 172, 175, 177 Natural Language Processing (NLP) Research field lying at the intersection of com- puter science, artificial intelligence, and linguistics that focuses on making human communication, such as speech and text, understandable to computers. Involves a variety of tasks (speech or language generation or understanding), techniques (pars- ing, stemming, tokenization, etc.), and applications (translation, question-answering, summarization, etc.). 9 NeMo toolkit NLP development toolkit provided by NVidia. 53–55 Pandas Open-source Python library for tabular and time-series text data analysis and manipulation. 52, 53, 70, 169, 170 PELCRA Research group from the University of Lódź. PELCRA stands for Polish and English Language Corpora for Research and Applications. xxiii xxii PELCRA for BIGOS Openly accessible speech dataset in BIGOS format curated from datasets developed by the PELCRA group.[111, 112, 109]. 56, 90 Polish ASR speech data catalog Structured information about existing Polish ASR speech datasets. Includes information about availability, license, size, content char- acteristics etc. Available on GitHub and Hugging Face and described in the article in PSICL journal[53]. 2, 61, 62, 70, 105 Polish ASR speech datasets survey Survey on available speech datasets for Polish ASR development. 125 RTF Metric for determining how long a system takes to process a given length of input signal compared to the duration of the signal itself. For real-time systems must be lower than 1. 25 Sampling rate Number of sound samples per time unit. Usually expressed in kiloHertz (kHz), which means 1,000 times per second. 29 SDE Speech Data Explorer. Tool. 53, 55, 169, 170 Semantic Word Error Rate (SWER) An ASR evaluation metric that extends tradi- tional WER by incorporating semantic weights. Proposed by Somnath Roy[125], SWER assigns higher weights to errors involving key semantic words or named en- tities, reflecting their impact on transcript meaning. It uses NLP techniques to calculate semantic similarity and includes CER for spelled-out entities, providing a nuanced measure of ASR accuracy that aligns with human judgment.. 37 SemDist Semantic Distance (SemDist) is a metric proposed by Kim et al.[56] that uses advanced language models like BERT to measure the semantic similarity between a reference transcription and ASR output. Unlike WER, SemDist evaluates the semantic correctness of the outputs, identifying deviations that alter the conveyed message. It uses token-level embeddings and similarity measures, such as cosine distance, to provide a quantitative measure of semantic accuracy.. 37 SER Sentence Error Rate. 54, 84, 169 xxiii SOX SOX (Sound eXchange) is a cross-platform command line utility that converts var- ious formats of computer audio files into other formats. It can also apply various effects to these sound files and play and record audio files on most platforms. 70 Speech dataset A collection of digital recordings of speech together with their annota- tions, meta data, and documentation.. 2, 10, 11, 24, 26 test data leakage Leakage occurs when a model accesses unauthorized information dur- ing training, leading to inflated performance metrics and compromised evaluation integrity.. 18, 171, 172 Text to Speech (TTS) Technology that converts written text into spoken words. This enables machines to read text in a human-like voice. Also known as speech synthesis.. 47, 165 Transformers Open-source Python library by Hugging Face offering state-of-the-art pre- trained NLP models for tasks such as text classification, translation, and question answering. It simplifies model integration, fine-tuning, and deployment.7. 55, 169 WAV RIFF WAVE audio format. 83 WER Word Error Rate (WER) is a metric for evaluating ASR performance. Calculated as number of word-level edit operations (deletions, insertions, substitutions) required to match the two texts divided by the total number of words in the reference text. xxi–xxiv, 1, 7, 18, 19, 23–25, 31, 33, 37, 42, 54, 84, 89, 90, 127, 129, 143, 159, 166, 169 WIL WIL stands for Word Information Lost. It used as the metric to assess the effeciency of conveying information encoded as text e.g. accuracy of ASR systems. WIL pro- vides a simple approximation of the proportion of lost word information in a sequence of words. Defined as 1 - WIP, where WIP stands for Word Information Preserved. [90]. 54, 169 WRR Word Recognition Rate (WRR) is a metric for evaluating ASR performance, cal- culated as 100% minus the WER. It measures the proportion of correctly recognized words in the ASR output.. 54, 169 7HF Transformers xxiv https://huggingface.co/docs/transformers/en/index Chapter 1 Introduction ASR systems process human speech signals into the corresponding textual transcriptions. Recent progress in machine learning technology, powerful computational resources, and abundance of data have significantly advanced ASR technology, resulting in a notable improvement in the accuracy of speech-to-text conversion. In 2017, Microsoft announced that the precision of its ASR system for English is on par with manual transcription when evaluated using the Switchboard corpus[49]. The quality in terms of the WER metric (word error rate) was below 5%. In 2022, the Whisper ASR system achieved an average WER of 5% for multiple test datasets and high-resource languages [117]. ASR technology is now widely used in various applications such as virtual assistants, meeting aids, voice search, smart home controls, and transcription tools. The increasing global demand for ASR solutions has made it a focal point of research aimed at improving speech recognition performance. Companies that develop ASR systems are constantly working to reduce error rates for new applications, domains, and languages to enhance the user experience and increase market adoption. The Polish language is spoken by more than 50 million people around the world and is the sixth most spoken language in the European Union. The number of commercial and freely available voice technology solutions and applications for Polish is steadily growing [95]. In July 2023, more than 50 speech data resources were available for training and evaluating ASR systems [53]. New language resources are being introduced thanks to global initiatives like Mozilla Common Voice [3] or Multilingual Librispeech [115] and local projects, e.g. DiaBiz [112], Spokes[111] etc. So far, no research has been conducted to survey, validate, or improve the usefulness of existing Polish ASR speech datasets. There are also no widely adopted Speech datasets for performing Benchmarks of ASR systems for the Polish language. International studies typically use popular multilingual datasets, for example Common Voice[3] or MLS[115], while Polish studies mostly use locally created datasets[111, 110, 65]. Although many datasets are available under permissive licenses, they are often used exclusively for specific studies due to interoperability concerns. Same practical obstacles may also contribute to the restricted use of all accessible speech datasets for Polish ASR benchmarks. This restriction limits usefulness for researchers and the public, who rely on benchmarks and leaderboards to track progress and identify suitable models. [91, 34] There is an ongoing debate within the international ASR community about the estab- lishment of a standardized evaluation methodology. [2, 137] Adopting standard methodol- ogy and datasets enables comparing results between different studies. However, to make the results relevant to various usage scenarios, a variety of representative datasets are needed. Examples from other fields of ML [96] show that it is important for the commu- nity to take advantage of existing and easily contributing new datasets for benchmarking purposes [10, 24, 145, 126]. To monitor technological advances over time, it is preferable to perform systematic benchmarking instead of one-time assessments. However, since there are many applications and variables of ASR that impact its effectiveness, establishing a common standard of evaluation is not trivial.[2] The objective of this thesis is to increase the practical utility of available speech datasets for the evaluation of Polish ASR systems. The proposed framework consists of a set of methods for speech data management and ASR evaluation. The key contributions include: 1. Survey of Polish ASR speech datasets and curation of Polish ASR speech data catalog 2. Survey of Polish ASR benchmarks 3. Curation of Benchmark dataset from publicly available sources 4. Development of framework for ASR systems benchmarking The thesis also discusses the strengths and limitations of existing speech datasets and outlines potential research directions to further improve ASR data management and bench- marking practices for Polish ASR. 2 1.1 Problem background 1.1.1 The role of datasets in the training and evaluation of machine learning systems Datasets are essential in the development of Machine Learning (machine learning) because they convey the signal used during the training, testing and validation of ML models. These datasets encode useful information, allowing algorithms to recognize patterns within the input data. Relevant information can be obtained from the original source or annotated via a dedicated process, typically by trained humans. ML datasets must be diverse and representative of target usage to ensure the accurate performance of models on new data during operation. In 2020 Andrew Ng, the co-creator of the Google Brain project, introduced the term "Data-Centric AI" to the public discourse. He noted that currently 90% academic work follows the "Model-Centric" paradigm, which assumes that data are fixed and that quality improvement is achieved through changes in architecture and the model training process. According to the "Data-Centric AI" paradigm, the model architecture and training process remain constant, and the quality improvement of the model operation is achieved by in- creasing the quality and size of the data used for training and testing the system. He notes that developing datasets that are not only widely applied but also actively maintained by the community is a challenge. In practice, the data preparation process is often treated as a one-time effort. The limited adoption of standards for documentation or quality assur- ance methods adds to the challenge[36, 8, 53, 45, 116]. As a result, the benchmark results of the ML systems from academic conferences can present a distorted picture of the state of technology development [72]. This can result from the relative simplicity of the task represented by the test set, e.g., the librispeech speech corpus that contains only records of spoken speech in a quiet environment)[100]. Another factor contributing to the reduced reliability of the benchmark results is inaccuracies in the data labels. For example, recent research from December 2022 indicates that in the 10 most popular test sets used in the benchmarks, an average of 3.3% of the data had incorrect labels[93]. 3 1.1.2 The role of speech datasets in the training and evaluation of ASR systems The ASR community is highly dependent on extensive training datasets that accurately represent the speech and acoustic patterns of the target population, as well as the operat- ing conditions of the ASR systems. The construction of datasets of this kind is a difficult undertaking that requires specialized infrastructure, meticulous planning, nimble recruit- ment operations, and resource-intensive data quality control [42, 3, 115, 52]. Moreover, to conduct responsible and informative testing of ASR systems [2], one needs access to eval- uation datasets that are free of errors, contain abundant metadata, and are up-to-date. This necessity makes the management of ASR speech datasets even more complex and demanding. Another challenge facing the ASR community is the discovery of relevant datasets that already exist. Currently, there is no centralized repository dedicated to ASR speech datasets, either multilingual or for the Polish language. As a result, researchers and in- dustry practitioners have to rely on information dispersed among many sources and may struggle to accurately determine the number of available datasets and their characteristics, such as size, recording devices, utterance domain, audio and transcription quality, and others. Ideally, a comprehensive data catalog should include download links to dataset samples, allowing seamless and in-depth inspection of the datasets of interest, in addition to the aforementioned metadata descriptors. Speech datasets commonly include distinct sets for training, validation, and testing. Validation sets assist in fine-tuning model parameters, while test sets gauge the final model’s performance, offering an impartial evaluation of its functionality in real-world scenarios. It is worth highlighting that speech datasets shall encompass various languages, dialects, accents, speech styles, and noise environments in order for ASR systems to be robust. This diversity guarantees that the ASR system can handle a wide range of speech variations and operate effectively in different settings and demographics of speakers. In addition, training of ASR models is heavily based on extensive and varied speech datasets. These datasets encode a wide spectrum of phonetic, linguistic, and acoustic at- tributes essential for precise speech recognition. datasets featuring conversational speech, ambient noise, and authentic speech patterns (e.g., pauses and interruptions) enable de- 4 velopers to efficiently handle real-world use cases like voice-activated assistants or IVR (Interactive Voice Response) systems. Finally, speech datasets also serve a role in the examination and mitigation of biases in ASR systems. datasets comprising a diverse array of voices and speech attributes can help to recognize and minimize bias related to accents, dialects, age, gender, and more. [1] New methods and resources are actively developed for the evaluation of ASR systems in academia and technology companies such as Google, Apple, Amazon, Meta, Appen [75] or Rev[21, 22]. However, details on specific methods or confidential datasets used to create commercial products are not disclosed because of the confidentiality nature. For-profit entities contribute predominantly to the curation of novel datasets from publicly accessible sources, e.g. [115, 19]. The relevant findings are presented at conferences related to speech technologies such as Interspeech or Language Resources and Evaluation (LREC)1, as well as workshops such as NeurIPS workshop on Evaluation and Benchmarks. 2 1.1.3 Challenges in ASR speech dataset management ASR practitioners managing speech datasets face numerous practical challenges. Data identification Identifying the right data for the task is often difficult. The information is spread across numerous data repositories and publications. Furthermore, there is no widely established standard for documenting and evaluating the potential application of speech datasets. Of- ten without manual inspection of the content of the datasets, it is not feasible to determine their quality or suitability for a specific task. Data formatting Although some data may be freely available and easily accessible, the diversity of audio and text file formats, data quality issues, and limited documentation can require significant data wrangling efforts before a dataset can be used effectively. Legal and licensing concerns Legal and licensing limitations may apply to the use of speech data, particularly when using data from public sources or third parties. Data privacy and ethics 1LREC 2NeurIPS 5 https://lrec-coling-2024.org/ https://neurips.cc/ Managing datasets that contain sensitive or personal data requires strict adherence to privacy laws and ethical guidelines, including obtaining consent from participants and anonymizing data where possible. Language evolution and terminology Language is constantly changing, with new vocabulary, expressions, and meanings fre- quently evolving. It is an ongoing challenge to ensure that speech datasets remain up-to- date with these linguistic shifts. Data bias Speech datasets can unintentionally exhibit biases toward specific demographic groups (such as age, gender, accent, and dialect), resulting in disparities in the performance of ASR systems for different user groups. [1] Audio data quality The accuracy of ASR may decrease due to background noise or poor audio quality. Therefore, it is crucial to manage these factors during data collection. If background noise or distorted speech are essential for the ecological validity of the ASR application under study, relevant metadata and documentation must be included to ensure an accurate interpretation of the evaluation results. Data annotation quality Annotation can be time-consuming and susceptible to human errors, particularly when dealing with large datasets, complex domains, and diverse annotation teams. Managing versions of datasets It is essential to maintain version control and provide users with accurate and up-to- date datasets as they are modified and improved over time. Effective dataset management practices are necessary for this purpose. Data storage and retrieval The size of high-fidelity audio files presents difficulties in storage and distribution, particularly with large datasets. Striking a balance between size and manageability Although larger datasets can improve ASR performance, they also present difficulties in terms of computational resources and training duration. Therefore, determining the optimal balance between the size of the dataset and the ease of management is a critical 6 issue. 1.1.4 Challenges in ASR evaluation Common challenges These are the challenges faced in the ASR evaluation process. Lack of ground truth There may not be definitive ground-truth transcription for the audio data being ana- lyzed, for example, in the case of multiple spelling conventions. Domain-specific challenges ASR systems may perform differently depending on the domain or context. For exam- ple, a system trained on news broadcasts may not perform as well on telephone conver- sations. Hence, a careful selection of appropriate evaluation datasets that represent the target domain is required. For example, significant discrepancies have been reported in a recent comparison of the accuracy of ASR systems for medical terminology in Polish. [68] and [153]. Metric selection Different metrics are used in the scientific literature, the most popular being WER (Word Error Rate). Depending on the ASR application, the appropriate evaluation metric and method should be used. Annotation consistency The annotation of the evaluation data must be consistent and unbiased between mul- tiple annotators. This requires the use of standardized annotation protocols and thorough training of the annotators. Limited resources Evaluation of ASR requires significant resources including data storage, computing and cloud usage costs, human expertise, and time for results analysis. Conflicts of interest Commercial sources often showcase and explain ASR solutions through company re- ports, testimonials, or white papers. These providers typically strive to highlight the strengths of their products. As a result, there is a need for independent comparative re- search on existing ASR systems, focusing on evaluating their performance, scalability, and 7 accessibility to provide practical benefits for particular applications or domains. Challenges in the industrial settings Additional factors must be taken into account when creating an ASR system within indus- trial environments. To ensure and continuously monitor the quality of technology, products or services, companies conduct continuous research, implementations, and tests with the aim of improving product features and eliminating defects and their causes. To test the quality of a solution based on machine learning algorithms under conditions that match actual use, it is necessary to prepare and continuously update test data that are represen- tative of the specific requirements of the offered solution, for example, language of target user group, device, and domain. Moreover, ASR systems must be tested to determine the impact of disturbances and modifications of the acoustic signal, such as: • Variable characteristics of sound processing in a given type or specific model of device, • Distance and position of the user relative to the device. • The presence of discontinuities and additive noise in the speech signal. Ideally, ASR testing should also verify the robustness to speech variations resulting from individual user characteristics such as gender, accent, age, language proficiency, ethnic background, emotional or health condition, articulation quality, and so on. To check whether the quality requirements of an ASR-based product or service are met, it is necessary to perform a series of tests on a representative sample for real-use conditions. In practice, obtaining representative test data before deploying a service/product to the market is a significant challenge and requires substantial investments in preparing the appropriate environment, scenarios, and processes to acquire and control data quality. This is because numerous companies do not possess sufficient resources and know-how to record new statements under controlled conditions or transcribe and annotate existing recordings. The requirements and characteristics of real-world usage data evolve rapidly. The more quality criteria are considered, the more extensive resources are required to design, create, curate, and validate ASR evaluation datasets. Companies developing ASR commercially require dedicated processes and systems to ensure the quality and availability of data for 8 the continuously changing product requirements. The coherent methodology includes data typologies, data standards, annotation protocols, operating procedures, and systems for data collection and annotation. 1.1.5 State of the ASR speech datasets and ASR evaluation for Polish In recent years, the field of NLP has experienced a surge in benchmarks designed to evaluate the most widely available systems in a wide range of datasets [145, 144]. The most advanced research on the methodology for evaluating ASR systems and requirements for data used for this purpose relates to the English language and, to a much lesser extent, selected European languages, such as German. In addition, there has been a growing interest in ASR benchmarks in the international community [139, 34, 2, 1, 26]. In Poland, the growing interest in data for AI development and benchmarks for the Natural Language Processing (NLP) (Natural Language Processing) area is evidenced by the PolEval competitions [59] organized annually and the KLEJ initiative (Comprehensive List of Language Evaluations) [126] and LEPISZCZE [4]. The first benchmark for Polish ASR systems was conducted in 2018. Three commercial ASR systems were evaluated on a set of recordings representing domain and acoustic conditions of security officer training. [99] In 2019, the first open competition was organized under the PolEval initiative [59]. Six community-provided systems were evaluated using datasets created by recordings of the Polish Parliament. The next benchmark in 2022 compared the accuracy of 3 commercial ASR systems using recordings from the customer support domain [112]. The most recent benchmarks focused on the accuracy of medical terms recognition accuracy.[153, 68]+ The major challenges of Polish ASR benchmarks include: • limited utilization of publicly available speech datasets • limited reproducibility due to lack of access to evaluation datasets • lack of independent quality verification of test sets used in evaluations • limited number of evaluated systems 9 1.2 Research aim The primary aim of this thesis was to design and implement a data management framework to increase the utility of the available Polish speech datasets for the evaluation of ASR systems. The initial stage involved creating a taxonomy and organizing metadata on existing speech datasets using publicly accessible information. The subsequent stage covered the quantitative evaluation of the characteristics of the datasets to determine their usefulness for the ASR evaluation. The selected datasets were then consolidated, refined and made openly accessible. The final stage was the development of an evaluation system and the use of curated Speech dataset to compare various ASR systems for the Polish language. 1.3 Research hypothesis The hypothesis advanced in this thesis is the following: The creation of an extensive data management framework will make it possible to reli- ably and objectively evaluate the ASR systems available for Polish. 1.4 Research objectives and questions This section presents the main research objectives (RO) and research questions (RQ). RO1: Survey of ASR speech datasets for Polish The first objective was to survey existing ASR speech datasets for Polish. The research questions addressed were: • RQ 1: How to systematically categorize Polish ASR speech datasets using public information? • RQ 2: What is the current state of Polish ASR speech datasets? • RQ 3: How can the survey findings be shared for community feedback? RO2: Design and curation of the speech dataset for Polish The second objective was to curate the dataset to evaluate ASR systems for Polish. The research questions considered were: • RQ 4: What factors are crucial in designing and curating dataset for benchmarking purposes? 10 • RQ 5: What are data curation steps required to create Benchmark dataset from publicly available speech datasets? • RQ 6: Which public Polish speech datasets can be used as benchmarks? • RQ 7: How can the curated dataset be shared with the community? RO3: Survey of ASR benchmarks for Polish Next goal was to categorize and review Polish ASR benchmarks with respect to datasets, systems, tasks, domains and evaluation metrics. The specific questions research included: • RQ 8: How to categorize Polish ASR benchmarks using public information? • RQ 9: What methods, datasets, and ASR systems have been used in Polish ASR benchmarks? • RQ 10: Which Polish ASR systems have not been evaluated? • RQ 11: Which benchmarks evaluated commercial and free systems? • RQ 12: Which ASR system performs best? • RQ 13: What are the main conclusions from the ASR benchmarks? • RQ 14: How to share the survey results with the community? RO4: Design and implementation of system for ASR systems benchmarking The following objective was the development of a system enabling the evaluation and comparison of ASR systems. The research was focused on the following aspects: • RQ 15: What tools and systems exist for ASR benchmarking? • RQ 16: What challenges arise in evaluating multiple ASR systems, and what strate- gies can address them? • RQ 17: How can the system be extended to new ASR systems, datasets, languages, metrics, and normalization methods? RO5: Using a curated dataset to benchmark ASR systems for Polish RO5 goal was to use the self-curated Speech dataset (RO3) and the evaluation system (RO4) to compare ASR systems for Polish. The specific research questions included: 11 • RQ 18: What is the ASR accuracy for different datasets? • RQ 19: What is the accuracy gap between commercial and free systems? • RQ 20: Does ASR accuracy vary with speech features? • RQ 21: Is there an accuracy difference by age or gender? • RQ 22: How to share evaluation results with the community? RO6: Organization of an open competition for the ASR community The goal was to organize a public contest for ASR practitioners to compare their solutions with the latest advances. • RQ 22: What programs can organize the Polish ASR community challenge? • RQ 23: How to compare community solutions with state-of-the-art ASR systems? 1.5 Research scope 1. Curation of Polish ASR speech data catalog Publicly available information about Polish speech datasets was manually annotated with a dedicated taxonomy. The resulting Polish ASR speech data catalog was used to select datasets for further curation. The practical utility of the catalog was evaluated through a user survey. 2. Curation of benchmark datasets from publicly available speech datasets The datasets were selected from the speech data catalog according to the ASR evalu- ation criteria. They underwent automatic refinement, including standardizing audio and metadata formats, and were organized into training, validation, and test sets. Erroneous samples were removed. 3. Analysis of curated datasets contents and preparation of dashboard for dataset features inspection Detailed analysis of the curated datasets was per- formed. To inspect and explore the characteristics of these datasets a dedicated dashboard was created. This tool allowed for a comprehensive inspection of the attributes of the dataset and facilitated better understanding of the data. 12 4. Survey of Polish ASR benchmarks A comprehensive survey was conducted to identify existing benchmarks for Polish ASR systems. The survey involved analyzing the available benchmarks, their methodologies and the datasets they used. Insights were derived to highlight the gaps and areas for improvement in current Polish ASR benchmarks. 5. Implementing system for ASR evaluation Developed a robust system to evalu- ate ASR systems. This system included tools for automatic and manual assessment of ASR output, incorporating various evaluation metrics such as WER (Word Error Rate), CER (Character Error Rate), and others. The system was designed to be scalable and adaptable for continuous benchmarking. 6. Benchmarking ASR systems for the Polish language The curated datasets were used to evaluate and compare the performance of ASR systems for the Polish language. In total, 25 models were evaluated. The results were made available to the community through the ASR leaderboard. 7. Publication of Polish ASR leaderboard A publicly accessible ASR leaderboard was developed, enabling comparison of the performance of the ASR system. Inter- active dashboards were included to allow users to explore the results in detail and compare different systems based on various criteria. 8. Organization of open ASR challenge The curated datasets were used to organize an open challenge for the Polish ASR community. This challenge aimed to engage the community in improving ASR technology for Polish and to benchmark new systems against the curated datasets. 1.6 Limitations This section lists the limitations of the research conducted. 1. Language specificity: The research is confined to the Polish language, a language with distinct linguistic attributes. Its findings may not extend to ASR systems for languages with divergent phonetic or grammatical structures. 13 2. Datasets selection: This study is based on a selection of publicly accessible Polish speech datasets intended for ASR. The limited scope of datasets might influence the applicability of the research to broader speech data contexts and corpus linguistic research. 3. Data curation constraints: Collecting new speech or annotations is beyond this work’s scope. Manual annotation was used to inspect existing data and validate automatic curation methods. No new recordings or annotations were added. 4. Technological focus: The study focused on ASR technology, particularly speech- to-text accuracy. Metrics like latency, real-time factor, voice biometrics, and down- stream task evaluation were not considered. 5. Resource availability: Research on the accuracy of commercial ASR systems and large ASR models was limited by funding and computational resources. 6. Temporal constraints: The study covers speech datasets available up to December 2023 and ASR systems up to March 2024. 7. Demographic and use case coverage: The research does not fully represent all segments of the Polish-speaking population, including unique dialects or speech variances. 8. Methodological boundaries:, Evaluation results are based on selected automatic metrics. The linguistic and acoustic analysis was limited to selected aspects. 9. Commercial and academic solutions: The analysis included various commercial and free ASR systems for Polish, though not all solutions are covered due to the rapidly evolving landscape. 1.7 Methodology adopted The methodology adopted in the research consisted of several steps listed below. Survey of Polish ASR speech datasets The method consisted of a review of publicly accessible information to catalog Polish ASR speech datasets Specific activities include: • Literature review and identification of existing speech datasets. 14 • Development of a taxonomy classification framework. identify and • Cataloging of speech datasets according to the framework. • Developing a publicly accessible digital repository and dashboard. Curation of datasets for Polish ASR systems evaluation The method utilized publicly available sources to curate diverse datasets for Polish ASR development . Specific activities include: • Selection of speech datasets based on the curated data catalog. • Data unification, normalization, and formatting. • Developing a publicly accessible digital repository and dashboard. Evaluation of ASR Systems for Polish The method used curated datasets to compare ASR systems in various scenarios. Specific activities include: • Selecting evaluation metrics • Evaluating ASR systems using recordings from curated datasets • Analyzing performance, highlighting strengths and weaknesses • Developing a public dashboard with results Organization of Polish ASR challenge Curated datasets were used to organize open competition to allow the comparison of state- of-the-art ASR systems with community-developed systems. Specific activities include: • Selecting a competition platform. • Establish participation and evaluation guidelines. 1.8 Contributions Below are the major contributions of this work to the Polish ASR field: 1. Creation of the largest Polish ASR speech data catalog, documenting 53 datasets with 65 attributes. 15 2. Development of a metadata schema for cataloging ASR speech datasets. 3. Analysis of the current state of the Polish ASR datasets and the proposal of future research directions. 4. Distribution of two datasets curated from 24 publicly available datasets. 5. Performing and sharing the analysis of the content of curated datasets. 6. Performing the survey and creating the catalog of Polish ASR benchmarks. 7. Development of an extensible system for ASR evaluation. 8. Comprehensive evaluation of Polish ASR systems involving 7 systems, 25 models and 24 datasets 9. Development of a publicly accessible ASR leaderboard with interactive dashboards. 10. Improvement of reproducibility and guidance for future ASR advancements by pro- viding public access to data catalogs, curated datasets, evaluation tools, and dash- boards. 11. Organization of an open challenge for the ASR community using curated datasets. 16 Chapter 2 Literature Review 2.1 Introduction This section presents literature relevant to the following topics: • Challenges in benchmarking of Machine Learning and ASR systems. • Challenges, methods and tools for the management of ASR speech datasets. • ASR speech datasets and benchmarks for the Polish language. Based on the review, relevant datasets, methods and tools required to create research artifacts and achieve research objectives were selected. 2.2 Benchmarking of Machine Learning Systems 2.2.1 Challenges in ML benchmarking Liao et al. provides a comprehensive overview of challenges and systemic issues in bench- marking practices in various subfields of machine learning (ML) [72]. In the meta-review, the authors studied more than 107 articles that describe benchmarks from subfields such as computer vision, natural language processing, recommender systems, and reinforce- ment learning. The major conclusion is that the inconsistency in evaluation standards and methodologies has led to claimed advances in machine learning that do not withstand thorough examination or do not possess the broad applicability initially assumed. The authors introduced concepts of internal and external validity of ML evaluations. Internal validity concerns the “correctness and fairness of evaluations in the context of a Figure 2.1: Internal and external issues identified in the ML evaluation practices. Source: [72] specific learning problem".[72]. Internal validity is negatively affected by incorrect baseline comparisons, errors in the construction of the test set, and overfitting due to test data leak- age. External validity, on the other hand, refers to the ’applicability and generalizability of the evaluation findings in different learning problems, tasks, or real-world scenarios’[72]. In case of misalignment of metrics and dataset with respect to the real-world scenario, the benchmark result may not accurately reflect the progress or performance of the ML application under the target conditions. Failures of both types are common and contribute to a misleading representation of progress within the ML field. Figure 2.1 presents spe- cific issues of internal and external validity throughout the ML lifecycle. The authors also propose a useful distinction between terms that are often used interchangeably in the ML benchmarking context: learning problems and tasks. A learning problem comprises a dataset of input and output pairs and an associated evaluation metric to score the proposed solutions (functions that correspond to the input space). The example is the Librispeech dataset with WER as a metric to score ASR systems. A task is described in a more general manner, either in the everyday language or formally. There is no fixed definition of a task, and the goal is not to set specific task definitions. Tasks can be found at different levels of detail, for example, from ’dog vs. cat classification’ to ’animal classification’ to ’image classification’, which naturally gives rise to a hierarchy (see Figure 2.2 ). For the purpose 18 Figure 2.2: ML tasks and learning problems universe. Source: [72] of evaluation, tasks are usually instantiated by learning problems. Given the above def- initions, a “benchmark is a learning problem framed as an indicator of progress on some task ” [72]. Benchmarks typically include a ranking system, contest, or other framework that defines the current state-of-the-art. Enhancing WER performance on the English Lib- rispeech dataset can be seen as an improvement in ASR task, but only within the specific scope and use case determined by the dataset. The recommendations to improve the robustness and reliability of ML benchmarks include: 1. adoption of more rigorous experimental designs 2. improved documentation standards 3. sharing of research artifacts, enabling replication and inspection 4. development of benchmarks that more accurately reflect real-world conditions. 2.2.2 Examples of methods for curating ML benchmarking datasets Introduction Evaluation of ML solutions can be challenging. Factors such as the specific learning prob- lem, the task at hand, the context of the application, and the objectives of the study must be taken into account for the benchmark to be useful. In addition, evaluation datasets are available from various sources, but their formatting, documentation, or access methods are often inconsistent. As a result, choosing and organizing the evaluation process can 19 be an additional burden for ML professionals and data scientists. Therefore, accessible, curated, and maintained public benchmark resources are essential to identify the strengths and weaknesses of different ML methodologies. The curation involves several processes to ensure the utility of the datasets for benchmarking purposes. This section presents examples of such curation processes and selected methods based on examples of popular benchmarks from various ML subfields. Examples of datasets curated for benchmarking purposes Penn Machine Learning Benchmark (PMLB) alpha 2017 [96] is a curated collec- tion of 165 datasets from a wide range of sources covering real-world, simulated and toy problems. The datasets were standardized with numerically encoded categorical features. Instances with fewer than 10 examples per class were removed to maintain reasonable learning scenarios. The curated datasets were then made available via a Python interface to simplify retrieval and working with the data. The authors performed a comparison of meta-features of datasets and found that they lacked the diversity to properly benchmark ML algorithms. The study also identified datasets for which the corresponding benchmarks matched or exceeded human baselines or achieved a plateau in performance, resulting in a so-calledbenchmark saturation. The study also identified more challenging datasets, offer- ing a range of difficulties to test Machine Learning methods. The original 2017 article was presented as an ongoing project and is still being developed. Penn Machine Learning Benchmark (PMLB) v1.0 2020[121] The updated ver- sion of the PMLB benchmarking suite was released in 2020 1. The original collection that covered classification tasks has been expanded to include regression tasks. Each dataset has been enhanced with a standardized metadata file that contains information about its original source, purpose description, related publications, keywords, and details about in- dividual features and their coding schemes. The structured metadata format simplified the validation process, leading to improved data accuracy and easier addition of new datasets by the community. The user experience has been enhanced with a new contribution guide and an improved website interface that allows browsing, sorting, filtering, and searching for datasets. Support for the R library was also added. Pandas-profiling reports for each 1https://epistasislab.github.io/pmlb/ 20 dataset were added that cover feature correlations and identification of duplicates and missing values, allowing users to make informed decisions regarding necessary modifica- tions prior to using a specific dataset. GLUE 2019 The GLUE (General Language Understanding Evaluation) benchmark2 is the collection of tools and assembly of existing datasets for nine NLP tasks, such as question answering, sentiment analysis, and textual entailment. GLUE includes test data that were never made public and a hand-crafted diagnostic dataset for detailed linguistic analysis. Manually annotated examples serve as a tool for error analysis, qualitative model comparison, and the development of adversarial examples [145]. The benchmark focus is not to reflect overall performance or generalization in downstream applications, but rather to understand the performance of general versus specialized models and their capabilities and limitations in handling complex linguistic phenomena. SUPERGLUE 2020 SuperGLUE [144] builds on its forerunner, the GLUE bench- mark, by incorporating a range of more challenging language comprehension tasks. Su- perGLUE was developed in response to the realization that performance on the GLUE benchmark exceeded that of non-specialist humans. New tasks were collected by issuing an open invitation for task suggestions within the NLP community. The tasks were se- lected based on their level of challenge for existing NLP methods and covered a variety of formats, such as coreference resolution and question answering. The datasets were derived from preexisting data to guarantee availability and consistency. The tasks must have avail- able public training data, have an automatic performance measure that correlates well with human evaluation, and should not require specialized knowledge beyond standard English proficiency. Human performance benchmarks were established for all tasks, ensuring am- ple scope for enhancing model performance. The benchmark was launched with a modular toolkit that facilitates model training, testing, and assessment. This toolkit was based on commonly used frameworks such as PyTorch and includes conventional models like BERT for initial evaluations. The leaderboard3 was structured to promote fair competition and meaningful comparisons of models. The guidelines for submissions are explicit on data usage and the tasks are designed to reduce overfitting and enhance the interpretability of model performance across a range of NLP tasks. 2https://gluebenchmark.com/ 3super.gluebenchmark.com 21 MMLU 20214 The Massive Multitask Language Understanding (MMLU) benchmark is designed to assess text models across a broad spectrum of fields and complexity levels. MMLU covers 15,908 questions from 57 topics. The questions were manually collected by graduate and undergraduate students from openly accessible online resources. The few-shot development (training) set has 5 questions for each subject, the validation set has 1,540 questions, and the test set has 14,079 questions. Each subject has questions of different difficulty levels, from elementary to high school, college, and professional. This enables one to gauge the depth of knowledge of a model and its capacity to deal with increasingly difficult content. Baseline results from both non-specialized human test-takers and experts are available. This comparison offers a context for assessing the performance of language models in relation to human abilities. The MMLU is designed for zero-shot and few- shot settings to evaluate the ability of models to generalize and apply knowledge without extensive fine-tuning, as in many real-world scenarios. BIG-Bench 20225 BIG-bench[135], which stands for Beyond the Imitation Game, is a benchmark for language models, comprising 204 tasks put forward by 450 authors from 132 different institutions. The tasks are varied and cover a wide range of topics, including lin- guistics, childhood development, mathematics, common sense reasoning, biology, physics, social bias, software development, and more. BIG-bench’s emphasis is on tasks that are thought to exceed the abilities of current language models. The tasks come in various formats, such as multiple choice and text-complete questions. The curation process was carried out transparently and cooperatively. Contributions were collected through GitHub pull requests and then subjected to a peer review process. This approach guaranteed a broad spectrum of tasks and viewpoints. Expert human raters were employed to complete all tasks, establishing a reference point to evaluate the performance of the language models. BIG-bench was created with the intention of facilitating the ongoing contributions of tasks and evaluations, ensuring its continued relevance. SUPERB 20216 SUPERB (Speech Processing Universal PERformance Benchmark) [139] is a toolkit and leaderboard to benchmark the performance of a shared model in a wide range of speech processing tasks with minimal architecture changes and labeled 4https://huggingface.co/datasets/cais/mmlu 5https://huggingface.co/datasets/bigbench 6https://arxiv.org/abs/2105.01051 22 data. Multiple speech processing is included, for example, phoneme recognition, automatic speech recognition, keyword spotting, speaker identification, speaker verification, speaker diarization, intent classification, slot filling, and emotion recognition. For the dataset to be included in the benchmark, it must adhere to the conventional protocols accepted by the speech community, be publicly accessible, and allow universal participation. datasets considered to be the standard benchmarks for various tasks are included, e.g. • LibriSpeech: Used for phoneme recognition and automatic speech recognition tasks. • Speech Commands V1.0: Utilized for keyword spotting to detect predefined words. • VOXCELEB1 : Employed for speaker identification and verification tasks. • Fluent Speech Commands: Used for intent classification. • IEMOCAP : Chosen for emotion recognition tasks. Each task has specific metrics for evaluation, such as the WER for speech recognition, the accuracy for keyword spotting and speaker identification, and the diarization error rate (DER) for speaker diarization. The benchmark goal is to encourage the development of models that can perform well on diverse speech processing tasks with minimal specific tuning for each task. ASR-GLUE 2022 ASR-GLUE [29] is a benchmark to study the effect of ASR error on NLU tasks in terms of noise intensity, error type and speaker variants. Six NLU tasks that are prevalent in speech-based scenarios are included: sentiment analysis, paraphrase detection, and natural language inference. Data instances were manually selected from existing NLU task datasets. The selection criteria excluded samples with non-standard words or overly long sentences to ensure clarity and quality in speech-to-text conversion. Six native speakers recorded the selected test samples in different noise environments. This was done to simulate real-world speech variations and introduce controlled ASR errors. The recordings were converted to text using an ASR system trained for this purpose. For tasks that require labeled data, the dataset maintained the original labels of the source datasets, ensuring that the impact of ASR errors could be assessed against known outcomes. The dataset is maintained by Tencent AI Lab, is publicly available, and open to community contributions.7 7ASR GLUE audio 23 https://drive.google.com/drive/folders/1slqI6pUiab470vCxQBZemQZN-a_ssv1Q ESB 20228 The End-to-End Speech Benchmark (ESB) [34] aims to evaluate ASR systems in various domains, eliminating the need for domain-specific adjustments. ESB consists of a range of speech datasets from various domains, including audiobooks, polit- ical speeches, educational talks, among others. Data instances are sourced from existing datasets such as LibriSpeech, Common Voice, VoxPopuli, TED-LIUM, GigaSpeech, SPGIS- peech, Earnings-22, and AMI. The source datasets of ESB are freely available and accessi- ble datasets to encourage broad participation and usage in the speech research community. Transcription artifacts, such as punctuation and casing, which are usually normalized in many ASR systems, are preserved in this benchmark to enhance the complexity and realism of speech recognition tasks. The diagnostic dataset with manually verified transcriptions is used for the public leaderboard available on the Hugging Face platform. 9 2.3 Benchmarking of Automatic Speech Recognition Systems This section presents a relevant work on the problem of evaluation of ASR systems. Popular methods, metrics, taxonomies, and analysis frameworks are discussed, along with known challenges and design considerations. 2.3.1 Introduction The evaluation process involves a numerical measurement of the usefulness of the output generated automatically for a given Machine Learning task. In case of ASR, typically aSpeech dataset and WER metric are used to represent Machine Learning task as a spe- cific Learning problem [72]. For example, the English ASR task can be assessed as a learning problem consisting of Librispeech Speech dataset and the metric WER[100]. The task of automatic recognition of Polish customer support conversations can be defined as the learning problem using the DiaBiz corpus and WER metric [112, 110]. The task of recognizing clean English speech defined using the Librispeech dataset reached the stage of benchmark saturation[148]. Furthermore, ASR systems can show on-par performance with humans on one set of Benchmark datasets and subpar accuracy across other set of use cases. As reported by Likhomenanko et al. “No single validation or test set from public 8https://huggingface.co/datasets/esb/datasets 9Open ASR Leaderboard 24 https://huggingface.co/spaces/hf-audio/open_asr_leaderboard datasets is adequate to gauge transferability to other public datasets or to real-world audio data” [73]. ASR systems based on an end-to-end architecture could even generate incoher- ent output when tested on speech from a domain that was not present in the training data [55]. Furthermore, the error rates of contemporary ASR systems evaluated on popular datasets can be lower than those achieved by trained humans [152]. Given the limited transferability of the evaluation results between learning problems and datasets, Aksenova et al. [2] suggest that the ultimate objective of the ideal ASR benchmark should be to verify the capacity of the ASR system to generalize in a wide range of use cases. Methods for comparing systems or ASR technologies can be classified as subjective or objective [15]. Subjective methods involve humans in the evaluation process and are best suited to assess the impact of ASR recognition error and root cause [101, 58, 28] or validate the quality of the evaluation data [148]. Their drawback is the inconsistency in quality assessment by human subjects and the cost of applying at scale. Objective methods offer the advantage of generating reproducible results because they do not require human involvement. Their key benefit is automation, with the resulting lower cost and faster execution. However, effectively evaluating the practical usability of ASR output in the context of the target application remains a challenge due to the complexity of the processes involved [104, 137]. To decide which system offers the best performance, relying solely on accuracy metrics such as WER may not be enough. Additional metrics to be considered include latency (real-time factor RTF [127] or precision in the downstream task [129]. 2.3.2 Overview of ASR benchmark design considerations The following aspects have impact on the utility of ASR benchmark: • scope of evaluated ASR systems, • diversity of datasets and use scenarios, • reliability of datasets, • diversity of analysis dimensions, • availability of evaluation results, • reproducibility of evaluation results. 25 ASR systems, with their wide range of applications and tasks, should ideally be resilient to different types of speech input variation. For instance, an ASR system that generates automatic captions for video meetings should be capable of recognizing words from diverse semantic fields, adjusting to the meeting’s subject. The characteristics of speech can also differ across various contexts: for instance, the style of speech used for dictating text messages is different from that of a group discussion, where participants might occasionally interrupt each other. Therefore, the benchmark can cover many ’horizontal ’ and ’vertical ’ challenges [2]. Horizontal challenges refer to ASR use cases, while vertical challenges refer to diversity of subjects, encoding formats, etc. The authors argue that “the more horizontal and vertical areas are covered by a benchmark, the more representative it will be, and hence it is more appropriate to measure ASR progress”.[2] These challenges and related aspects are discussed in more detail in the following subsections. 2.3.3 ASR use scenarios Ideally, the benchmark for ASR systems covers many ASR use cases. The best way to represent various usage scenarios is the creation of a comprehensive Speech dataset, either by merging existing datasets [73, 12] or by collecting new data to fill the gaps. Aksenova et al. [2] proposed a taxonomy of ASR use cases based on their experience developing an ASR-based customer-facing product at Google. The overview of the challenges and differences in the use cases can be found in tables 2.1 and 2.2, respectively. Text dictation function is to enable the input of text into a digital device with- out manual typing. Typically, it involves relatively slow speech from a single speaker. As the user consciously interacts with a device, the speech is adjusted to maximize the chance of correct understanding [18]. Typical applications include general purpose dicta- tion on desktop / mobile / portable devices, medical records transcription [78, 87], legal proceedings transcription[41, 23], language learning with computer-aided pronunciation feedback[82, 119] and speech-to-speech translation [134]. Voice search and control allow individuals to retrieve information or perform tasks through verbal commands. Speech patterns have human-to-device interaction characteris- tics and often contain specific nouns required to perform the task, for example, navigate to a location of interest or play a song on a streaming service. Another example is interactive 26 voice response (IVR) applications, where individuals contacting customer service engage with a voice-operated chatbot. This chatbot can either assist in collecting data before transferring the call or be capable of addressing the problems on its own. [86] Voicemails, oration, and audiobooks scenarios include using the ASR system to provide transcription for voicemail messages [48, 5], parliamentary speeches [65, 66, 143, 35, 107, 57, 62, 76, 51, 67, 133], and audiobooks [115, 100]. In these scenarios’ speech typically originate from a single speaker. Spontaneity artifacts such as hesitations, fillers, back-channel speech, disfluencies, false starts, and corrections are present [37, 84]. In case of audiobooks the human-to-human speech features are less prevalent[50]. Conversations and meetings scenario typically involves transcribing spontaneous speech among several participants within a single audio recording. As with voicemails, oration and audiobooks, this type of speech is considered human-to-human speech. The presence of noise, overlapping, and distant speech adds to the challenge of recognizing spontaneous speech [54]. Practical applications include the transcription of video meeti