Faculty of Mathematics and Computer Science

Adam Mickiewicz University, Poznań

Micha l Junczyk, MSc

Application of speech datasets management methods for the

evaluation of Automatic Speech Recognition systems for

Polish.

PhD thesis

Supervisor:

prof. dr hab. Krzysztof Jassem

Discipline of science:

Computer and information sciences

Field of science:

Natural sciences


Wydzia l Matematyki i Informatyki

Uniwersytet im. Adama Mickiewicza w Poznaniu

mgr inż. Micha l Junczyk

Zastosowanie metod zarządzania zbiorami nagrań mowy do

oceny jakości systemów automatycznego rozpoznawania

mowy dla języka polskiego.

Rozprawa doktorska

Promotor:

prof. dr hab. Krzysztof Jassem

Dyscyplina naukowa:

Informatyka

Dziedzina nauki:

Nauki ścis le i przyrodnicze


Declaration

I, Micha l Junczyk, declare that the work in this dissertation titled "Application of speech

data sets management methods for the evaluation of Automatic Speech Recognition systems

for Polish.” is carried out by me. This work has not been submitted to Adam Mickiewicz

University or any other educational institution for the award of a degree or educational

qualification. The information published in this dissertation has been obtained and pre-

sented in accordance with academic rules and ethical conduct. Information obtained from

other sources has been referenced appropriately.

i


Dedication

I am grateful to many who supported me throughout this PhD journey. I greatly appreciate

the unwavering support, insightful feedback, and patient guidance of my supervisor prof.

dr hab. Krzysztof Jassem.

I thank the leadership and staff of the Department of Mathematics and Computer

Science and the Doctoral School of Exact Sciences for a supportive and stimulating envi-

ronment.

I am grateful for the feedback and support from my mentors and colleagues at Sam-

sung and Allegro, especially dr Miko laj Wypych, dr inż. Bartosz Broda, dr inż. Marcin

Sowański, mgr inż. Ireneusz Gawlik, mgr inż. Robert Mroczkowski, dr Aleksander Wawer,

dr inż. Pawe l Zawistowski and last, but not least, mgr inż. Pawe l Cyrta.

A heartfelt thank you to my parents for their lifelong support and belief in me. Lastly,

to my beloved wife and children: your patience, sacrifices, and unwavering support made

this journey possible.

ii


Abstract

Automatic speech recognition (ASR) systems transform spoken language into written text,

enabling virtual assistants, transcription tools and intelligent home control. These systems

rely on large and diverse speech data sets that reflect the linguistic and acoustic charac-

teristics of the target population and user group. The Polish language, spoken by over

50 million people, presents ASR with unique challenges and opportunities due to its rich

phonetic and morphological structure.

Public domain speech datasets are often underutilized due to discoverability and in-

teroperability issues. Limited access to evaluation datasets makes it difficult to verify and

replicate the quality tests of ASR systems. Comprehensive assessment of multiple ASR

systems requires an efficient data management structure. This study addresses these issues

by creating comprehensive, accessible, and actively maintained datasets, promoting best

practices in ASR benchmarking inspired by international standards.

The study examined and cataloged 53 publicly available speech datasets, organized the

dataset from 24 sources, and developed a quality assessment process for ASR systems.

The curated dataset includes nearly 400,000 recordings and over 800 hours of speech from

5,000 speakers. Selected recordings were used to compare 7 ASR systems and 25 models.

The research revealed significant differences in the performance of ASR systems in various

test scenarios. All resources and results have been made publicly available to promote

transparency, peer review, and collaboration within the research community.

This study improved methods for data management and benchmarking of ASR systems.

The comprehensive review and catalog increased the discoverability of Polish ASR speech

datasets, and the curated BIGOS and PELCRA datasets provided an extensive resource

of diverse speech recordings. The use of Polish ASR datasets for comparative purposes has

increased threefold compared to previous studies. Improved documentation and analysis

iii


of the understanding of the test data and the availability of data sets and assessment tools

will positively impact the ability to validate and compare test results. The development

of a data management methodology and a benchmarking system has improved reliable

assessments and comparative analyzes of ASR systems and understanding of the strengths

and weaknesses of ASR systems for Polish.

To sum up, the conducted research has a positive impact on the practical usefulness

of Polish ASR datasets for academic and industrial applications. They also contribute to

the promotion of methods, tools, and good practices used for the benchmarking of ASR

systems.

iv


Contents

Glossary xix

1 Introduction 1

1.1 Problem background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The role of datasets in the training and evaluation of machine learn-

ing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 The role of speech datasets in the training and evaluation of ASR

systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Challenges in ASR speech dataset management . . . . . . . . . . . . 5

1.1.4 Challenges in ASR evaluation . . . . . . . . . . . . . . . . . . . . . . 7

1.1.5 State of the ASR speech datasets and ASR evaluation for Polish . . 9

1.2 Research aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Research hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Research objectives and questions . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Research scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7 Methodology adopted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Literature Review 17

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Benchmarking of Machine Learning Systems . . . . . . . . . . . . . . . . . . 17

2.2.1 Challenges in ML benchmarking . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Examples of methods for curating ML benchmarking datasets . . . . 19

2.3 Benchmarking of Automatic Speech Recognition Systems . . . . . . . . . . 24

v


2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Overview of ASR benchmark design considerations . . . . . . . . . . 25

2.3.3 ASR use scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.4 Technical challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.6 Evaluation results analysis . . . . . . . . . . . . . . . . . . . . . . . 41

2.4 ASR speech datasets management methods and tools . . . . . . . . . . . . . 44

2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.2 ASR speech dataset lifecycle . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.3 Overview of the ASR dataset management methods . . . . . . . . . 46

2.4.4 Challenges related to ASR speech datasets management . . . . . . . 48

2.5 ASR speech datasets and benchmarks for Polish . . . . . . . . . . . . . . . . 50

2.5.1 ASR speech datasets for Polish . . . . . . . . . . . . . . . . . . . . . 50

2.5.2 ASR speech benchmarks for Polish . . . . . . . . . . . . . . . . . . . 50

2.6 Overview of tools for dataset management and ASR evaluation . . . . . . . 52

2.6.1 ASR speech datasets management tools . . . . . . . . . . . . . . . . 52

2.6.2 ASR evaluation tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Methodology 56

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 56

3.2.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 56

3.2.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 61

3.3.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 61

3.3.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.3 Dataset analysis process . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.3.4 Dataset release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.4 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 77

3.4.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 77

3.4.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 RO4: Design and implementation of a system for ASR systems benchmarking 80

vi


3.5.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 80

3.5.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6 RO5: Use of curated dataset for benchmarking ASR systems for Polish . . . 85

3.6.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 85

3.6.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.7 RO6: Organization of competition for the ASR community . . . . . . . . . . 93

3.7.1 Research objectives and questions . . . . . . . . . . . . . . . . . . . . 93

3.7.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.8.1 Overview of the data management framework . . . . . . . . . . . . . 93

4 Results 96

4.1 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 96

4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.1.2 ASR speech datasets survey results overview . . . . . . . . . . . . . 96

4.1.3 ASR speech data survey results . . . . . . . . . . . . . . . . . . . . . 97

4.1.4 Survey availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 104

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2.2 Datasets features derived from the documentation . . . . . . . . . . 105

4.2.3 Datasets features derived from the analysis of datasets contents . . . 108

4.2.4 Availability of curated datasets . . . . . . . . . . . . . . . . . . . . . 114

4.3 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 115

4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.4 RO5: Use of curated dataset for benchmarking ASR systems for Polish . . . 126

4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.4.2 Evaluation setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.4.3 Evaluation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.4.4 Reference and ASR Transcripts Normalization . . . . . . . . . . . . . 145

4.4.5 Evaluation results sharing . . . . . . . . . . . . . . . . . . . . . . . . 147

4.5 RO6: Organization of open competition for the ASR community . . . . . . 148

vii


4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.5.2 Program selection and task creation . . . . . . . . . . . . . . . . . . 149

4.5.3 Comparison of community ASR solutions with other systems for Polish149

5 Discussion 151

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.2 RO1: Survey of ASR speech datasets for Polish . . . . . . . . . . . . . . . . 152

5.2.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.2.2 Observations from community feedback . . . . . . . . . . . . . . . . 155

5.2.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.2.5 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . 158

5.3 RO2: Design and curation of ASR benchmark dataset for Polish . . . . . . . 160

5.3.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

5.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.3.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.4 RO3: Survey of ASR benchmarks for Polish . . . . . . . . . . . . . . . . . . 166

5.4.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.4.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.5 RO4: Design and implementation of system for ASR systems benchmarking 169

5.5.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

5.6 RO5: Using a curated dataset to benchmark ASR systems for Polish . . . . 170

5.6.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.6.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.6.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.6.4 Methodological gaps in ASR benchmarking addressed in this study . 187

5.6.5 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . 188

5.7 RO6: Organization of competition for the ASR community . . . . . 190

6 Conclusion 191

6.1 Main Research Questions and Answers . . . . . . . . . . . . . . . . . . . . . 191

6.2 Contributions and achievements . . . . . . . . . . . . . . . . . . . . . . . . . 194

viii


6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

6.4 Research Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7 Appendix 198

7.1 ASR speech datasets survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

7.1.1 Attributes of speech datasets catalog . . . . . . . . . . . . . . . . . . 198

7.1.2 Attributes of ASR benchmarks survey . . . . . . . . . . . . . . . . . 203

7.1.3 Freely available speech datasets for Polish ASR . . . . . . . . . . . . 206

7.1.4 Commercially available speech datasets for Polish ASR . . . . . . . . 207

7.1.5 Dataset subsets cards . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.1.6 Commercial ASR systems pricing . . . . . . . . . . . . . . . . . . . . 223

7.1.7 Freely available ASR models sizes . . . . . . . . . . . . . . . . . . . . 223

7.1.8 Call for participation in 2024 Polish ASR challenge . . . . . . . . . . 224

ix


List of Figures

2.1 Internal and external issues identified in the ML evaluation practices. Source:

[72] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 ML tasks and learning problems universe. Source: [72] . . . . . . . . . . . . 19

2.3 Examples of WER sliced into groups A, B, and C, with the width of the

bars reflecting relative sizes of those groups. Source: [2] . . . . . . . . . . . 43

2.4 ASR speech dataset lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1 Overall research framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Process of analysis of curated datasets. . . . . . . . . . . . . . . . . . . . . . 76

3.3 ASR evaluation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.4 ASR evaluation process data flow . . . . . . . . . . . . . . . . . . . . . . . . 82

3.5 BIGOS data management framework . . . . . . . . . . . . . . . . . . . . . . 94

4.1 Normalized cumulative size of Polish ASR speech datasets . . . . . . . . . . 98

4.2 ASR benchmark results — POLSL dataset. Source: [99]. . . . . . . . . . . . 119

4.3 ASR benchmark results — BOR dataset scenario 1, year 2018. Source: [99] 120

4.4 ASR benchmark results — PolEval year 2019. Source: [59] . . . . . . . . . . 121

4.5 ASR benchmark results — DiaBiz corpus, year 2022. Source: [110] . . . . . 122

4.6 ASR benchmark results — Whisper, MLS corpus, year 2022. Source: [117] . 122

4.7 ASR benchmark results — Whisper, CommonVoice corpus, year 2022. Source:

[117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.8 ASR benchmark results — Whisper, VoxPopuli corpus, year 2022. Source:

[117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.9 ASR benchmark results — Whisper, FLEURS corpus, year 2022. Source:

[117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

x


4.10 ASR evaluation results — SpokesBiz corpus, year 2023. Source: [111] . . . . 123

4.11 ASR benchmark results — Accuracy of medical terms recognition, Kuligowska

et. al, year 2023. Source: [68] . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.12 ASR benchmark results — Recognition errors classification, Kuligowska et.

al, year 2023. Source: [68] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.13 ASR benchmark results — Medical terms recognition, Zielonka et. al, year

2023. Source: [153] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.14 Mean WER per system for all BIGOS dataset subsets. . . . . . . . . . . . . 129

4.15 Mean WER for all PELCRA dataset subsets. . . . . . . . . . . . . . . . . . 129

4.16 Box plot of WER for all systems per specific subset of BIGOS dataset. . . . 131

4.17 Box plot of WER for all systems per specific subset of PELCRA dataset. . . 132

4.18 Comparison of WER for the most accurate free and paid ASR systems.

BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.19 Comparison of WER for the least accurate free and paid ASR systems.

BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.20 Comparison of WER for the most accurate free and paid ASR systems.

PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

4.21 Comparison of WER for the least accurate free and paid ASR systems.

PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.22 WER for freely available systems for various model sizes. BIGOS dataset. . 138

4.23 WER for freely available systems for various model sizes. PELCRA dataset. 139

4.24 Average WER in function of audio duration. BIGOS dataset. . . . . . . . . 140

4.25 Mean WER in function of audio duration. PELCRA dataset. Best paid and

free systems. The size of the point corresponds to the number of samples. . 141

4.26 Mean WER in function of speech rate for top systems. BIGOS dataset. . . 142

4.27 Mean WER in function of speech rate for top systems. PELCRA dataset. . 142

4.28 Standard deviation in WER across speaker age groups. PELCRA dataset. . 144

4.29 Impact of normalization on error rates on BIGOS dataset. . . . . . . . . . . 145

4.30 Impact of normalization on error rates on PELCRA dataset. . . . . . . . . . 146

4.31 Management framework extension to incorporate results from PolEval open

challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

xi


5.1 WER scores of top systems in Polish ASR benchmarks. . . . . . . . . . . . 167

5.2 Number of datasets and vocabulary domains in Polish ASR benchmarks. . . 188

5.3 Number of evaluated models in Polish ASR benchmarks. . . . . . . . . . . . 189

5.4 Number of dataset-system-model combinations in Polish ASR benchmarks . 189

xii


List of Tables

2.1 ASR use scenarios overview. Inspired by work of Aksenova et al.[2] . . . . . 28

2.2 Types of sources, recipients, and modes for various ASR use scenarios. In-

spired by work of Aksenova et al.[2] . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Vertical aspects of ASR challenges. Inspired by: [2] . . . . . . . . . . . . . . 29

2.4 Practical challenges of ASR evaluation process . . . . . . . . . . . . . . . . 31

2.5 Metrics used for ASR evaluation . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Differences between WER, MER and WIL values for different input/output

combinations. Source: [90] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 Evaluation results analysis dimensions . . . . . . . . . . . . . . . . . . . . . 44

2.8 Stages and methods of speech data management . . . . . . . . . . . . . . . 46

2.9 Tools for ASR datasets management . . . . . . . . . . . . . . . . . . . . . . 53

2.10 Tools for ASR evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1 Attributes of datasets and their relevance to ASR evaluation . . . . . . . . . 64

3.2 Sample of PELCRA SNUV references and ASR outputs . . . . . . . . . . . 70

3.3 Overview of factors enhancing specific dataset utility for ASR evaluation

purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Overview of factors decreasing datasets’ utility for ASR evaluation purposes. 71

3.5 Meta-data and partitioning of source datasets — BIGOS dataset . . . . . . 72

3.6 Meta-data and partitioning of source datasets — PELCRA dataset . . . . . 73

3.7 Meta-data and partitioning of source datasets . . . . . . . . . . . . . . . . . 73

3.8 Meta-data and partitioning of source datasets . . . . . . . . . . . . . . . . . 74

3.9 Attributes in the BIGOS utterance data object . . . . . . . . . . . . . . . . 74

3.10 Metrics used for analysis of datasets contents. . . . . . . . . . . . . . . . . . 76

xiii


3.11 Design considerations for ASR evaluation system . . . . . . . . . . . . . . . 81

3.12 Methods of normalizing references and hypotheses . . . . . . . . . . . . . . 84

3.13 Evaluation scenarios and their analysis dimensions . . . . . . . . . . . . . . 87

3.14 Relation between research question and evaluation scenarios. . . . . . . . . . 87

3.15 Whisper model types. Source: Whisper model card. . . . . . . . . . . . . . 89

3.16 ASR systems evaluated in the study. . . . . . . . . . . . . . . . . . . . . . . 91

3.17 Evaluated ASR systems usage cost and license type. . . . . . . . . . . . . . 92

4.1 Polish ASR datasets survey summary . . . . . . . . . . . . . . . . . . . . . . 97

4.2 Summary of audio dataset availability and characteristics by year . . . . . 98

4.3 Institutions contributing speech datasets for Polish. . . . . . . . . . . . . . . 100

4.4 Data catalogs and platforms hosting ASR speech datasets for Polish . . . . 101

4.5 Audio devices for all available datasets . . . . . . . . . . . . . . . . . . . . . 102

4.6 Audio devices for publicly available datasets . . . . . . . . . . . . . . . . . . 102

4.7 Audio devices for commercially available datasets . . . . . . . . . . . . . . . 103

4.8 Distribution of sampling rate for publicly reported ASR speech datasets for

Polish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.9 Distribution of speech types for publicly reported ASR speech datasets for

Polish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.10 Speaker and recordings meta-data availability in available speech datasets . 104

4.11 Summary statistics of curated datasets . . . . . . . . . . . . . . . . . . . . . 105

4.12 BIGOS dataset subset license and language coverage. . . . . . . . . . . . . . 105

4.13 PELCRA for BIGOS dataset subset license and language coverage. . . . . . 106

4.14 BIGOS dataset subset domains and speech types. . . . . . . . . . . . . . . . 106

4.15 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 107

4.16 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 107

4.17 PELCRA for BIGOS dataset subset domains and speech types. . . . . . . . 108

4.18 Audio content size metrics for BIGOS dataset . . . . . . . . . . . . . . . . . 108

4.19 Audio content size metrics for PELCRA dataset . . . . . . . . . . . . . . . . 109

4.20 Text content size metrics for BIGOS dataset . . . . . . . . . . . . . . . . . . 109

4.21 Text content size metrics for PELCRA dataset . . . . . . . . . . . . . . . . 110

4.22 Text content features for BIGOS dataset . . . . . . . . . . . . . . . . . . . . 110

xiv


4.23 Text content features for PELCRA dataset . . . . . . . . . . . . . . . . . . 111

4.24 Audio content features for BIGOS dataset . . . . . . . . . . . . . . . . . . . 111

4.25 Audio content features for PELCRA dataset . . . . . . . . . . . . . . . . . . 112

4.26 Average duration of audio recordings and utterances — BIGOS dataset. . . 112

4.27 Average duration of audio recordings and utterances — PELCRA dataset. . 113

4.28 Coverage of speaker meta-data — BIGOS dataset . . . . . . . . . . . . . . . 113

4.29 Coverage of speaker meta-data — PELCRA dataset . . . . . . . . . . . . . . 114

4.30 Publication date and number of downloads of BIGOS datasets as of June

6th 2024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.31 Overview of sections providing relevant results to research questions RQ7-

RQ13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.32 Overview of ASR use-cases covered in Polish ASR benchmarks to date. . . . 116

4.33 Public domain ASR benchmarks 2018-2023. . . . . . . . . . . . . . . . . . . 116

4.34 Overview of domains, speech types, audio sources and recording devices. . . 117

4.35 Datasets size and number of domains, recordings, and speakers. . . . . . . . 117

4.36 Acoustic conditions, annotations, and speaker meta-data across Polish ASR

benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.37 Overview of metrics employed in Polish ASR systems benchmarks. . . . . . 118

4.38 Publicly reported evaluations of ASR models for Polish language. . . . . . . 119

4.39 Number of reported independent evaluations and benchmarks per system. . 119

4.40 ASR systems supporting Polish not yet evaluated in the public domain. . . 120

4.41 Types of ASR systems evaluated in public domain ASR benchmarks 2018-

2023. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.42 ASR benchmarks performed in this study. . . . . . . . . . . . . . . . . . . . 126

4.43 ASR systems evaluation scenarios overview . . . . . . . . . . . . . . . . . . 126

4.44 Evaluation details for BIGOS dataset . . . . . . . . . . . . . . . . . . . . . . 127

4.45 Evaluation details for PELCRA dataset . . . . . . . . . . . . . . . . . . . . 127

4.46 WER statistics – BIGOS dataset . . . . . . . . . . . . . . . . . . . . . . . . 128

4.47 WER statistics – PELCRA dataset . . . . . . . . . . . . . . . . . . . . . . . 130

4.48 WER statistics for all ASR systems and specific subsets of BIGOS dataset. 131

4.49 WER statistics for all ASR systems and specific subsets of PELCRA set. . . 132

xv


4.50 WER statistics for free and paid ASR systems on BIGOS dataset. . . . . . 133

4.51 Best and worse systems for BIGOS dataset. . . . . . . . . . . . . . . . . . . 133

4.52 WER statistics for the most accurate free and commercial systems. BIGOS

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.53 WER statistics for the most accurate free and commercial systems. BIGOS

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.54 WER statistics for free and paid ASR systems evaluated on PELCRA dataset.135

4.55 Best and worst accurate systems for PELCRA dataset. . . . . . . . . . . . . 135

4.56 The most accurate systems WER statistics. PELCRA dataset. . . . . . . . 136

4.57 The least accurate systems WER statistics. PELCRA dataset. . . . . . . . 136

4.58 Average WER for free systems with information about model size. . . . . . 137

4.59 Average WER for free systems with information about model size. . . . . . 138

4.60 Mean WER for specific audio duration ranges. BIGOS dataset. Best paid

and free systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.61 Mean WER for specific audio duration ranges for top paid and free systems.

PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4.62 Number of samples with speaker gender information. . . . . . . . . . . . . . 143

4.63 Values and differences in mean WER scores per speaker gender. . . . . . . . 144

4.64 Number of samples with speaker gender information. . . . . . . . . . . . . . 145

4.65 Values and differences in average WER scores per speaker gender for 689

samples from PELCRA dataset per gender. . . . . . . . . . . . . . . . . . . 146

4.66 Mean WER across systems and age ranges. PELCRA dataset. . . . . . . . . 147

4.67 Standard Dev. and maximum difference in WER across age groups. PEL-

CRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.68 Reduction of error rates caused by normalization of references and hypoth-

esis for BIGOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.69 Reduction of error rates caused by normalization of references and hypoth-

esis for PELCRA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

5.1 Benchmark dataset design requirements validation . . . . . . . . . . . . . . 165

5.2 Evaluated models, datasets, and their combinations. . . . . . . . . . . . . . 187

xvi


7.1 Publicly and freely available speech datasets for Polish. . . . . . . . . . . . . 206

7.2 Commercially available speech datasets for Polish. . . . . . . . . . . . . . . 207

7.3 Dataset size per split — Clarin Mobile. . . . . . . . . . . . . . . . . . . . . . 207

7.4 Dataset features per split — Clarin Mobile. . . . . . . . . . . . . . . . . . . 207

7.5 Dataset size per split — Clarin Studio. . . . . . . . . . . . . . . . . . . . . . 208

7.6 Dataset features per split — Clarin Studio. . . . . . . . . . . . . . . . . . . . 208

7.7 Dataset size per split — MLS. . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.8 Dataset features per split — MLS. . . . . . . . . . . . . . . . . . . . . . . . 209

7.9 Dataset size per split — Munich AI Labs Librivox. . . . . . . . . . . . . . . 209

7.10 Dataset features per split — Munich AI Labs Librivox. . . . . . . . . . . . . 209

7.11 Dataset size per split — Common Voice. . . . . . . . . . . . . . . . . . . . . 210

7.12 Dataset features per split — Common Voice. . . . . . . . . . . . . . . . . . . 210

7.13 Dataset size per split — AZON read. . . . . . . . . . . . . . . . . . . . . . . 210

7.14 Dataset features per split — AZON read. . . . . . . . . . . . . . . . . . . . . 211

7.15 Dataset size per split — AZON spontaneous. . . . . . . . . . . . . . . . . . . 211

7.16 Dataset features per split — AZON spontaneous. . . . . . . . . . . . . . . . 211

7.17 Dataset size per split — PWR Maleset. . . . . . . . . . . . . . . . . . . . . . 212

7.18 Dataset features per split — PWR Maleset. . . . . . . . . . . . . . . . . . . 212

7.19 Dataset size per split — PWR Shortwords. . . . . . . . . . . . . . . . . . . . 212

7.20 Dataset features per split — PWR Shortwords . . . . . . . . . . . . . . . . . 213

7.21 Dataset size per split — PWR Very Important Utterances. . . . . . . . . . . 213

7.22 Dataset features per split — PWR Very Important Utterances. . . . . . . . 213

7.23 Dataset size per split — Google FLEURS. . . . . . . . . . . . . . . . . . . . 214

7.24 Dataset features per split — Google FLEURS. . . . . . . . . . . . . . . . . . 214

7.25 Dataset size per split — Minds-14. . . . . . . . . . . . . . . . . . . . . . . . 214

7.26 Dataset features per split — Minds-14. . . . . . . . . . . . . . . . . . . . . . 215

7.27 Dataset size per split — PolEval 22. . . . . . . . . . . . . . . . . . . . . . . . 215

7.28 Dataset features per split — PolEval 22. . . . . . . . . . . . . . . . . . . . . 215

7.29 Dataset size per split — Spokes Mix Emo. . . . . . . . . . . . . . . . . . . . 216

7.30 Dataset features per split — Spokes Mix Emo. . . . . . . . . . . . . . . . . . 216

7.31 Dataset size per split — Spokes Mix Luz. . . . . . . . . . . . . . . . . . . . . 216

xvii


7.32 Dataset features per split — Spokes Mix Luz. . . . . . . . . . . . . . . . . . 217

7.33 Dataset size per split — Spokes Mix Parl. . . . . . . . . . . . . . . . . . . . 217

7.34 Dataset features per split — Spokes Mix Parl. . . . . . . . . . . . . . . . . . 217

7.35 Dataset size per split — Spokes Biz Bio. . . . . . . . . . . . . . . . . . . . . 218

7.36 Dataset features per split — Spokes Biz Bio. . . . . . . . . . . . . . . . . . . 218

7.37 Dataset size per split — Spokes Biz Interviews. . . . . . . . . . . . . . . . . 218

7.38 Dataset features per split — Spokes Biz Interviews. . . . . . . . . . . . . . . 219

7.39 Dataset size per split — Spokes Biz Luz. . . . . . . . . . . . . . . . . . . . . 219

7.40 Dataset features per split — Spokes Biz Luz. . . . . . . . . . . . . . . . . . . 219

7.41 Dataset size per split — Spokes Biz Podcasts. . . . . . . . . . . . . . . . . . 220

7.42 Dataset features per split — Spokes Biz Podcasts. . . . . . . . . . . . . . . . 220

7.43 Dataset size per split — Spokes Biz Presentations . . . . . . . . . . . . . . . 220

7.44 Dataset features per split — Spokes Biz Presentations . . . . . . . . . . . . 221

7.45 Dataset size per split — Spokes Biz Various 1. . . . . . . . . . . . . . . . . . 221

7.46 Dataset features per split — Spokes Biz Various 1. . . . . . . . . . . . . . . 221

7.47 Dataset size per split — Spokes Biz Various 2. . . . . . . . . . . . . . . . . . 222

7.48 Dataset features per split — Spokes Biz Various 2. . . . . . . . . . . . . . . 222

7.49 Dataset size per split — Spokes Biz Interviews. . . . . . . . . . . . . . . . . 222

7.50 Dataset features per split — Spokes Biz Interviews. . . . . . . . . . . . . . . 223

7.51 Commercial ASR services pricing . . . . . . . . . . . . . . . . . . . . . . . . 223

7.52 Size of freely available ASR models . . . . . . . . . . . . . . . . . . . . . . . 224

xviii


Glossary

Machine Learning task An abstract problem statement, defined either in natural lan-

guage or formally. Tasks vary in granularity, creating a hierarchy, such as “dog vs. cat

classification” to “image classification.” They frame contributions in Machine Learn-

ing field and are instantiated by Learning problems for evaluation. Examples include

MNIST, CIFAR-10, and ImageNet for the “image classification” task.. 24

AMU ASR Leaderboard Publicly accessible leaderboard presenting results of ASR sys-

tems benchmarking supporting Polish on BIGOS datasets. 56, 90, 149

ASR Automatic Speech Recognition (ASR) is a technology that enables machines to

process speech input and translate it into text. Also known as Speech Recognition

or Speech-to-Text (STT). 1, 10, 17, 24, 44

Audio encoding Manner in which digital audio signal is encoded for storage or transmis-

sion. The most popular audio encodings for speech and ASR applications are lossless

encodings of PCM or FLAC and lossy encodings of Opus and Speex . 29

AZON Repository of open data from Wroc law University of Technology (Atlas Zasobów

Otwartej Nauki).1. 102

Benchmark A learning problem serving as an indicator of progress on a ML task. Bench-

marks often include a leaderboard and open competition. For example, within the

ILSVRC competition (ImageNet Large Scale Visual Recognition Challenge) increas-

ing accuracy on ImageNet benchmark dataset reflects advancements in image classi-

fication task.. 2
1AZON project

xix

https://zasobynauki.pl/projekt-azon-2/


Benchmark dataset A benchmark dataset is a curated, widely accepted reference used

to evaluate and compare algorithms, models, or systems in a specific domain. It

provides a consistent basis for comparison and objective assessment.[58, 9, 27, 96]

Benchmark datasets with specific metrics are referred to as learning problems and

represent more abstract tasks.[72]. 2, 11, 24, 63, 66, 69, 85, 160

benchmark saturation Phenomenon, which occurs when a learning problem becomes

too easy for current ML models, leading to a plateau in performance. This can

happen due to various reasons, such as overfitting to test data, advancements in

technology or not challenging enough dataset or evaluation metric.. 20, 24, 62

BIGOS BIGOS stands for Benchmark Intended Grouping of Open Speech). BIGOS is a

set of curated datasets intended to facilitate benchmarking of ASR systems. Cur-

rently, BIGOS is focused on the Polish language. 2. x, 56, 63, 69, 70, 90, 94

CER Character Error Rate. xxiii, 54, 84, 169

Common Voice Large-scale, multilingual speech dataset collected with crowdsourcing

via Mozilla Corporation. 2, 102, 123, 163, 171–173, 177

data curation Broad set of data management techniques, such as acquisition, formatting,

documentation, enrichment, annotation or quality verification, aimed at improving

the practical utility of datasets. 11

FLEURS Few-shot Learning Evaluation of Universal Representations of Speech. 123

Forced alignment Method of analyzing and synchronizing the speech content with its

transcription to achieve temporal alignment. 53

Gated datasets Access to datasets on Hugging Face which allows authors to control

dataset usage by requiring users to request access, providing their username and

email. Authors can approve requests manually or automatically and may ask for addi-

tional information.3https://huggingface.co/docs/hub/en/datasets-gatedHugging Face’s

documentation on gated datasets. 77, 162, 163
2BIGOS in Polish means "cabbage stew". The name is inspired by the work of Google on SpeechStew[12]
3¸

xx


GitHub Web-based platform for version control and collaborative software development

using Git. Supports public and private repositories, and features like pull requests,

issue tracking, and project wikis. 4. 54, 194

GMM Gaussian Mixture Model. 121

Hallucinations In ASR systems, hallucinations are outputs that do not match any spoken

input. They can result from background noise, poor audio quality, or ASR model

limitations, leading to incorrect transcriptions and reduced system reliability.. 175,

178

Hugging Face Datasets Python library by Hugging Face designed for efficient handling

and processing of large datasets. Offers simple access to a wide range of datasets,

tools for dataset loading, transformation, and evaluation. Supports various data

formats and integration with machine learning workflows.5. 53, 55, 77, 81, 162, 169,

170

Hugging Face Hub Platform for hosting, sharing, and discovering machine learning

models and datasets. Provides tools for collaborative development, discovering re-

sources, version control, and deployment of models and datasets, enhancing accessi-

bility and community engagement.6. 77, 162, 163, 194

JIWER Python library for calculating evaluation metrics for ASR systems such as WER

based on minimum-edit distance between one or more reference and hypothesis sen-

tences. 54, 170

Kaldi framework Toolkit for speech recognition written in C++, licensed under the

Apache License v2.0 intended for speech recognition researchers. 121

Learning problem A learning problem consists of a dataset of (input, output) pairs and

an evaluation metric to score solutions (functions from input to output). It is fully

defined by these components without needing external semantics or data; for example,

4GitHub
5HF datasets
6HF Hub

xxi

https://github.com/
https://huggingface.co/docs/datasets/index
https://huggingface.co/datasets


the ILSVRC-2012 dataset (ImageNet) with top-1 accuracy as the metric.[72]. xix,

24

LibriVox Python library for music and audio analysis. Provides the building blocks

necessary to create music information retrieval systems. 101

Librosa Python library for music and audio analysis.. 70

Machine Learning Development of algorithms and statistical models that perform spe-

cific tasks without explicit instructions. These algorithms and models learn from and

make predictions or decisions based on data. xix, 3, 17, 20, 24

MER Match Error Rate (MER) is calculated as the number of errors (insertions, deletions,

and substitutions) divided by the number of words in the hypothesis, addressing the

issue of unbounded WER in cases of high insertion errors.. 35, 54, 84, 169

ML evaluation set A subset of data used to assess the performance of machine learning

models. It is separate from the training data and is utilized to provide an unbiased

evaluation of a model’s accuracy, generalization, and effectiveness on unseen data..

31

MLS Multilingual LibriSpeech. 2, 123, 172, 175, 177

Natural Language Processing (NLP) Research field lying at the intersection of com-

puter science, artificial intelligence, and linguistics that focuses on making human

communication, such as speech and text, understandable to computers. Involves a

variety of tasks (speech or language generation or understanding), techniques (pars-

ing, stemming, tokenization, etc.), and applications (translation, question-answering,

summarization, etc.). 9

NeMo toolkit NLP development toolkit provided by NVidia. 53–55

Pandas Open-source Python library for tabular and time-series text data analysis and

manipulation. 52, 53, 70, 169, 170

PELCRA Research group from the University of  Lódź. PELCRA stands for Polish and

English Language Corpora for Research and Applications. xxiii

xxii


PELCRA for BIGOS Openly accessible speech dataset in BIGOS format curated from

datasets developed by the PELCRA group.[111, 112, 109]. 56, 90

Polish ASR speech data catalog Structured information about existing Polish ASR

speech datasets. Includes information about availability, license, size, content char-

acteristics etc. Available on GitHub and Hugging Face and described in the article

in PSICL journal[53]. 2, 61, 62, 70, 105

Polish ASR speech datasets survey Survey on available speech datasets for Polish

ASR development. 125

RTF Metric for determining how long a system takes to process a given length of input

signal compared to the duration of the signal itself. For real-time systems must be

lower than 1. 25

Sampling rate Number of sound samples per time unit. Usually expressed in kiloHertz

(kHz), which means 1,000 times per second. 29

SDE Speech Data Explorer. Tool. 53, 55, 169, 170

Semantic Word Error Rate (SWER) An ASR evaluation metric that extends tradi-

tional WER by incorporating semantic weights. Proposed by Somnath Roy[125],

SWER assigns higher weights to errors involving key semantic words or named en-

tities, reflecting their impact on transcript meaning. It uses NLP techniques to

calculate semantic similarity and includes CER for spelled-out entities, providing a

nuanced measure of ASR accuracy that aligns with human judgment.. 37

SemDist Semantic Distance (SemDist) is a metric proposed by Kim et al.[56] that uses

advanced language models like BERT to measure the semantic similarity between

a reference transcription and ASR output. Unlike WER, SemDist evaluates the

semantic correctness of the outputs, identifying deviations that alter the conveyed

message. It uses token-level embeddings and similarity measures, such as cosine

distance, to provide a quantitative measure of semantic accuracy.. 37

SER Sentence Error Rate. 54, 84, 169

xxiii


SOX SOX (Sound eXchange) is a cross-platform command line utility that converts var-

ious formats of computer audio files into other formats. It can also apply various

effects to these sound files and play and record audio files on most platforms. 70

Speech dataset A collection of digital recordings of speech together with their annota-

tions, meta data, and documentation.. 2, 10, 11, 24, 26

test data leakage Leakage occurs when a model accesses unauthorized information dur-

ing training, leading to inflated performance metrics and compromised evaluation

integrity.. 18, 171, 172

Text to Speech (TTS) Technology that converts written text into spoken words. This

enables machines to read text in a human-like voice. Also known as speech synthesis..

47, 165

Transformers Open-source Python library by Hugging Face offering state-of-the-art pre-

trained NLP models for tasks such as text classification, translation, and question

answering. It simplifies model integration, fine-tuning, and deployment.7. 55, 169

WAV RIFF WAVE audio format. 83

WER Word Error Rate (WER) is a metric for evaluating ASR performance. Calculated

as number of word-level edit operations (deletions, insertions, substitutions) required

to match the two texts divided by the total number of words in the reference text.

xxi–xxiv, 1, 7, 18, 19, 23–25, 31, 33, 37, 42, 54, 84, 89, 90, 127, 129, 143, 159, 166,

169

WIL WIL stands for Word Information Lost. It used as the metric to assess the effeciency

of conveying information encoded as text e.g. accuracy of ASR systems. WIL pro-

vides a simple approximation of the proportion of lost word information in a sequence

of words. Defined as 1 - WIP, where WIP stands for Word Information Preserved.

[90]. 54, 169

WRR Word Recognition Rate (WRR) is a metric for evaluating ASR performance, cal-

culated as 100% minus the WER. It measures the proportion of correctly recognized

words in the ASR output.. 54, 169
7HF Transformers

xxiv

https://huggingface.co/docs/transformers/en/index


Chapter 1

Introduction

ASR systems process human speech signals into the corresponding textual transcriptions.

Recent progress in machine learning technology, powerful computational resources, and

abundance of data have significantly advanced ASR technology, resulting in a notable

improvement in the accuracy of speech-to-text conversion. In 2017, Microsoft announced

that the precision of its ASR system for English is on par with manual transcription when

evaluated using the Switchboard corpus[49]. The quality in terms of the WER metric

(word error rate) was below 5%. In 2022, the Whisper ASR system achieved an average

WER of 5% for multiple test datasets and high-resource languages [117]. ASR technology

is now widely used in various applications such as virtual assistants, meeting aids, voice

search, smart home controls, and transcription tools. The increasing global demand for

ASR solutions has made it a focal point of research aimed at improving speech recognition

performance. Companies that develop ASR systems are constantly working to reduce

error rates for new applications, domains, and languages to enhance the user experience

and increase market adoption.

The Polish language is spoken by more than 50 million people around the world and is

the sixth most spoken language in the European Union. The number of commercial and

freely available voice technology solutions and applications for Polish is steadily growing

[95]. In July 2023, more than 50 speech data resources were available for training and

evaluating ASR systems [53]. New language resources are being introduced thanks to

global initiatives like Mozilla Common Voice [3] or Multilingual Librispeech [115] and local

projects, e.g. DiaBiz [112], Spokes[111] etc.


So far, no research has been conducted to survey, validate, or improve the usefulness

of existing Polish ASR speech datasets. There are also no widely adopted Speech datasets

for performing Benchmarks of ASR systems for the Polish language. International studies

typically use popular multilingual datasets, for example Common Voice[3] or MLS[115],

while Polish studies mostly use locally created datasets[111, 110, 65]. Although many

datasets are available under permissive licenses, they are often used exclusively for specific

studies due to interoperability concerns. Same practical obstacles may also contribute

to the restricted use of all accessible speech datasets for Polish ASR benchmarks. This

restriction limits usefulness for researchers and the public, who rely on benchmarks and

leaderboards to track progress and identify suitable models. [91, 34]

There is an ongoing debate within the international ASR community about the estab-

lishment of a standardized evaluation methodology. [2, 137] Adopting standard methodol-

ogy and datasets enables comparing results between different studies. However, to make

the results relevant to various usage scenarios, a variety of representative datasets are

needed. Examples from other fields of ML [96] show that it is important for the commu-

nity to take advantage of existing and easily contributing new datasets for benchmarking

purposes [10, 24, 145, 126]. To monitor technological advances over time, it is preferable to

perform systematic benchmarking instead of one-time assessments. However, since there

are many applications and variables of ASR that impact its effectiveness, establishing a

common standard of evaluation is not trivial.[2]

The objective of this thesis is to increase the practical utility of available speech datasets

for the evaluation of Polish ASR systems. The proposed framework consists of a set of

methods for speech data management and ASR evaluation. The key contributions include:

1. Survey of Polish ASR speech datasets and curation of Polish ASR speech data catalog

2. Survey of Polish ASR benchmarks

3. Curation of Benchmark dataset from publicly available sources

4. Development of framework for ASR systems benchmarking

The thesis also discusses the strengths and limitations of existing speech datasets and

outlines potential research directions to further improve ASR data management and bench-

marking practices for Polish ASR.

2


1.1 Problem background

1.1.1 The role of datasets in the training and evaluation of machine

learning systems

Datasets are essential in the development of Machine Learning (machine learning) because

they convey the signal used during the training, testing and validation of ML models. These

datasets encode useful information, allowing algorithms to recognize patterns within the

input data. Relevant information can be obtained from the original source or annotated

via a dedicated process, typically by trained humans. ML datasets must be diverse and

representative of target usage to ensure the accurate performance of models on new data

during operation.

In 2020 Andrew Ng, the co-creator of the Google Brain project, introduced the term

"Data-Centric AI" to the public discourse. He noted that currently 90% academic work

follows the "Model-Centric" paradigm, which assumes that data are fixed and that quality

improvement is achieved through changes in architecture and the model training process.

According to the "Data-Centric AI" paradigm, the model architecture and training process

remain constant, and the quality improvement of the model operation is achieved by in-

creasing the quality and size of the data used for training and testing the system. He notes

that developing datasets that are not only widely applied but also actively maintained by

the community is a challenge. In practice, the data preparation process is often treated as

a one-time effort. The limited adoption of standards for documentation or quality assur-

ance methods adds to the challenge[36, 8, 53, 45, 116]. As a result, the benchmark results

of the ML systems from academic conferences can present a distorted picture of the state

of technology development [72]. This can result from the relative simplicity of the task

represented by the test set, e.g., the librispeech speech corpus that contains only records

of spoken speech in a quiet environment)[100]. Another factor contributing to the reduced

reliability of the benchmark results is inaccuracies in the data labels. For example, recent

research from December 2022 indicates that in the 10 most popular test sets used in the

benchmarks, an average of 3.3% of the data had incorrect labels[93].

3


1.1.2 The role of speech datasets in the training and evaluation of ASR

systems

The ASR community is highly dependent on extensive training datasets that accurately

represent the speech and acoustic patterns of the target population, as well as the operat-

ing conditions of the ASR systems. The construction of datasets of this kind is a difficult

undertaking that requires specialized infrastructure, meticulous planning, nimble recruit-

ment operations, and resource-intensive data quality control [42, 3, 115, 52]. Moreover, to

conduct responsible and informative testing of ASR systems [2], one needs access to eval-

uation datasets that are free of errors, contain abundant metadata, and are up-to-date.

This necessity makes the management of ASR speech datasets even more complex and

demanding.

Another challenge facing the ASR community is the discovery of relevant datasets

that already exist. Currently, there is no centralized repository dedicated to ASR speech

datasets, either multilingual or for the Polish language. As a result, researchers and in-

dustry practitioners have to rely on information dispersed among many sources and may

struggle to accurately determine the number of available datasets and their characteristics,

such as size, recording devices, utterance domain, audio and transcription quality, and

others. Ideally, a comprehensive data catalog should include download links to dataset

samples, allowing seamless and in-depth inspection of the datasets of interest, in addition

to the aforementioned metadata descriptors.

Speech datasets commonly include distinct sets for training, validation, and testing.

Validation sets assist in fine-tuning model parameters, while test sets gauge the final

model’s performance, offering an impartial evaluation of its functionality in real-world

scenarios. It is worth highlighting that speech datasets shall encompass various languages,

dialects, accents, speech styles, and noise environments in order for ASR systems to be

robust. This diversity guarantees that the ASR system can handle a wide range of speech

variations and operate effectively in different settings and demographics of speakers.

In addition, training of ASR models is heavily based on extensive and varied speech

datasets. These datasets encode a wide spectrum of phonetic, linguistic, and acoustic at-

tributes essential for precise speech recognition. datasets featuring conversational speech,

ambient noise, and authentic speech patterns (e.g., pauses and interruptions) enable de-

4


velopers to efficiently handle real-world use cases like voice-activated assistants or IVR

(Interactive Voice Response) systems. Finally, speech datasets also serve a role in the

examination and mitigation of biases in ASR systems. datasets comprising a diverse array

of voices and speech attributes can help to recognize and minimize bias related to accents,

dialects, age, gender, and more. [1]

New methods and resources are actively developed for the evaluation of ASR systems

in academia and technology companies such as Google, Apple, Amazon, Meta, Appen [75]

or Rev[21, 22]. However, details on specific methods or confidential datasets used to create

commercial products are not disclosed because of the confidentiality nature. For-profit

entities contribute predominantly to the curation of novel datasets from publicly accessible

sources, e.g. [115, 19]. The relevant findings are presented at conferences related to speech

technologies such as Interspeech or Language Resources and Evaluation (LREC)1, as well

as workshops such as NeurIPS workshop on Evaluation and Benchmarks. 2

1.1.3 Challenges in ASR speech dataset management

ASR practitioners managing speech datasets face numerous practical challenges.

Data identification

Identifying the right data for the task is often difficult. The information is spread across

numerous data repositories and publications. Furthermore, there is no widely established

standard for documenting and evaluating the potential application of speech datasets. Of-

ten without manual inspection of the content of the datasets, it is not feasible to determine

their quality or suitability for a specific task.

Data formatting

Although some data may be freely available and easily accessible, the diversity of audio

and text file formats, data quality issues, and limited documentation can require significant

data wrangling efforts before a dataset can be used effectively.

Legal and licensing concerns

Legal and licensing limitations may apply to the use of speech data, particularly when

using data from public sources or third parties.

Data privacy and ethics

1LREC
2NeurIPS

5

https://lrec-coling-2024.org/
https://neurips.cc/


Managing datasets that contain sensitive or personal data requires strict adherence

to privacy laws and ethical guidelines, including obtaining consent from participants and

anonymizing data where possible.

Language evolution and terminology

Language is constantly changing, with new vocabulary, expressions, and meanings fre-

quently evolving. It is an ongoing challenge to ensure that speech datasets remain up-to-

date with these linguistic shifts.

Data bias

Speech datasets can unintentionally exhibit biases toward specific demographic groups

(such as age, gender, accent, and dialect), resulting in disparities in the performance of

ASR systems for different user groups. [1]

Audio data quality

The accuracy of ASR may decrease due to background noise or poor audio quality.

Therefore, it is crucial to manage these factors during data collection. If background

noise or distorted speech are essential for the ecological validity of the ASR application

under study, relevant metadata and documentation must be included to ensure an accurate

interpretation of the evaluation results.

Data annotation quality

Annotation can be time-consuming and susceptible to human errors, particularly when

dealing with large datasets, complex domains, and diverse annotation teams.

Managing versions of datasets

It is essential to maintain version control and provide users with accurate and up-to-

date datasets as they are modified and improved over time. Effective dataset management

practices are necessary for this purpose.

Data storage and retrieval

The size of high-fidelity audio files presents difficulties in storage and distribution,

particularly with large datasets.

Striking a balance between size and manageability

Although larger datasets can improve ASR performance, they also present difficulties

in terms of computational resources and training duration. Therefore, determining the

optimal balance between the size of the dataset and the ease of management is a critical

6


issue.

1.1.4 Challenges in ASR evaluation

Common challenges

These are the challenges faced in the ASR evaluation process.

Lack of ground truth

There may not be definitive ground-truth transcription for the audio data being ana-

lyzed, for example, in the case of multiple spelling conventions.

Domain-specific challenges

ASR systems may perform differently depending on the domain or context. For exam-

ple, a system trained on news broadcasts may not perform as well on telephone conver-

sations. Hence, a careful selection of appropriate evaluation datasets that represent the

target domain is required. For example, significant discrepancies have been reported in a

recent comparison of the accuracy of ASR systems for medical terminology in Polish. [68]

and [153].

Metric selection

Different metrics are used in the scientific literature, the most popular being WER

(Word Error Rate). Depending on the ASR application, the appropriate evaluation metric

and method should be used.

Annotation consistency

The annotation of the evaluation data must be consistent and unbiased between mul-

tiple annotators. This requires the use of standardized annotation protocols and thorough

training of the annotators.

Limited resources

Evaluation of ASR requires significant resources including data storage, computing and

cloud usage costs, human expertise, and time for results analysis.

Conflicts of interest

Commercial sources often showcase and explain ASR solutions through company re-

ports, testimonials, or white papers. These providers typically strive to highlight the

strengths of their products. As a result, there is a need for independent comparative re-

search on existing ASR systems, focusing on evaluating their performance, scalability, and

7


accessibility to provide practical benefits for particular applications or domains.

Challenges in the industrial settings

Additional factors must be taken into account when creating an ASR system within indus-

trial environments. To ensure and continuously monitor the quality of technology, products

or services, companies conduct continuous research, implementations, and tests with the

aim of improving product features and eliminating defects and their causes. To test the

quality of a solution based on machine learning algorithms under conditions that match

actual use, it is necessary to prepare and continuously update test data that are represen-

tative of the specific requirements of the offered solution, for example, language of target

user group, device, and domain. Moreover, ASR systems must be tested to determine the

impact of disturbances and modifications of the acoustic signal, such as:

• Variable characteristics of sound processing in a given type or specific model of device,

• Distance and position of the user relative to the device.

• The presence of discontinuities and additive noise in the speech signal.

Ideally, ASR testing should also verify the robustness to speech variations resulting from

individual user characteristics such as gender, accent, age, language proficiency, ethnic

background, emotional or health condition, articulation quality, and so on.

To check whether the quality requirements of an ASR-based product or service are met,

it is necessary to perform a series of tests on a representative sample for real-use conditions.

In practice, obtaining representative test data before deploying a service/product to the

market is a significant challenge and requires substantial investments in preparing the

appropriate environment, scenarios, and processes to acquire and control data quality.

This is because numerous companies do not possess sufficient resources and know-how

to record new statements under controlled conditions or transcribe and annotate existing

recordings.

The requirements and characteristics of real-world usage data evolve rapidly. The more

quality criteria are considered, the more extensive resources are required to design, create,

curate, and validate ASR evaluation datasets. Companies developing ASR commercially

require dedicated processes and systems to ensure the quality and availability of data for

8


the continuously changing product requirements. The coherent methodology includes data

typologies, data standards, annotation protocols, operating procedures, and systems for

data collection and annotation.

1.1.5 State of the ASR speech datasets and ASR evaluation for Polish

In recent years, the field of NLP has experienced a surge in benchmarks designed to evaluate

the most widely available systems in a wide range of datasets [145, 144]. The most advanced

research on the methodology for evaluating ASR systems and requirements for data used

for this purpose relates to the English language and, to a much lesser extent, selected

European languages, such as German. In addition, there has been a growing interest in

ASR benchmarks in the international community [139, 34, 2, 1, 26].

In Poland, the growing interest in data for AI development and benchmarks for the

Natural Language Processing (NLP) (Natural Language Processing) area is evidenced by

the PolEval competitions [59] organized annually and the KLEJ initiative (Comprehensive

List of Language Evaluations) [126] and LEPISZCZE [4].

The first benchmark for Polish ASR systems was conducted in 2018. Three commercial

ASR systems were evaluated on a set of recordings representing domain and acoustic

conditions of security officer training. [99] In 2019, the first open competition was organized

under the PolEval initiative [59]. Six community-provided systems were evaluated using

datasets created by recordings of the Polish Parliament. The next benchmark in 2022

compared the accuracy of 3 commercial ASR systems using recordings from the customer

support domain [112]. The most recent benchmarks focused on the accuracy of medical

terms recognition accuracy.[153, 68]+

The major challenges of Polish ASR benchmarks include:

• limited utilization of publicly available speech datasets

• limited reproducibility due to lack of access to evaluation datasets

• lack of independent quality verification of test sets used in evaluations

• limited number of evaluated systems

9


1.2 Research aim

The primary aim of this thesis was to design and implement a data management framework

to increase the utility of the available Polish speech datasets for the evaluation of ASR

systems.

The initial stage involved creating a taxonomy and organizing metadata on existing

speech datasets using publicly accessible information. The subsequent stage covered the

quantitative evaluation of the characteristics of the datasets to determine their usefulness

for the ASR evaluation. The selected datasets were then consolidated, refined and made

openly accessible. The final stage was the development of an evaluation system and the

use of curated Speech dataset to compare various ASR systems for the Polish language.

1.3 Research hypothesis

The hypothesis advanced in this thesis is the following:

The creation of an extensive data management framework will make it possible to reli-

ably and objectively evaluate the ASR systems available for Polish.

1.4 Research objectives and questions

This section presents the main research objectives (RO) and research questions (RQ).

RO1: Survey of ASR speech datasets for Polish The first objective was to survey

existing ASR speech datasets for Polish. The research questions addressed were:

• RQ 1: How to systematically categorize Polish ASR speech datasets using public

information?

• RQ 2: What is the current state of Polish ASR speech datasets?

• RQ 3: How can the survey findings be shared for community feedback?

RO2: Design and curation of the speech dataset for Polish The second objective

was to curate the dataset to evaluate ASR systems for Polish. The research questions

considered were:

• RQ 4: What factors are crucial in designing and curating dataset for benchmarking

purposes?

10


• RQ 5: What are data curation steps required to create Benchmark dataset from

publicly available speech datasets?

• RQ 6: Which public Polish speech datasets can be used as benchmarks?

• RQ 7: How can the curated dataset be shared with the community?

RO3: Survey of ASR benchmarks for Polish Next goal was to categorize and

review Polish ASR benchmarks with respect to datasets, systems, tasks, domains and

evaluation metrics. The specific questions research included:

• RQ 8: How to categorize Polish ASR benchmarks using public information?

• RQ 9: What methods, datasets, and ASR systems have been used in Polish ASR

benchmarks?

• RQ 10: Which Polish ASR systems have not been evaluated?

• RQ 11: Which benchmarks evaluated commercial and free systems?

• RQ 12: Which ASR system performs best?

• RQ 13: What are the main conclusions from the ASR benchmarks?

• RQ 14: How to share the survey results with the community?

RO4: Design and implementation of system for ASR systems benchmarking

The following objective was the development of a system enabling the evaluation and

comparison of ASR systems. The research was focused on the following aspects:

• RQ 15: What tools and systems exist for ASR benchmarking?

• RQ 16: What challenges arise in evaluating multiple ASR systems, and what strate-

gies can address them?

• RQ 17: How can the system be extended to new ASR systems, datasets, languages,

metrics, and normalization methods?

RO5: Using a curated dataset to benchmark ASR systems for Polish RO5

goal was to use the self-curated Speech dataset (RO3) and the evaluation system (RO4)

to compare ASR systems for Polish. The specific research questions included:

11


• RQ 18: What is the ASR accuracy for different datasets?

• RQ 19: What is the accuracy gap between commercial and free systems?

• RQ 20: Does ASR accuracy vary with speech features?

• RQ 21: Is there an accuracy difference by age or gender?

• RQ 22: How to share evaluation results with the community?

RO6: Organization of an open competition for the ASR community The goal

was to organize a public contest for ASR practitioners to compare their solutions with the

latest advances.

• RQ 22: What programs can organize the Polish ASR community challenge?

• RQ 23: How to compare community solutions with state-of-the-art ASR systems?

1.5 Research scope

1. Curation of Polish ASR speech data catalog Publicly available information

about Polish speech datasets was manually annotated with a dedicated taxonomy.

The resulting Polish ASR speech data catalog was used to select datasets for further

curation. The practical utility of the catalog was evaluated through a user survey.

2. Curation of benchmark datasets from publicly available speech datasets

The datasets were selected from the speech data catalog according to the ASR evalu-

ation criteria. They underwent automatic refinement, including standardizing audio

and metadata formats, and were organized into training, validation, and test sets.

Erroneous samples were removed.

3. Analysis of curated datasets contents and preparation of dashboard for

dataset features inspection Detailed analysis of the curated datasets was per-

formed. To inspect and explore the characteristics of these datasets a dedicated

dashboard was created. This tool allowed for a comprehensive inspection of the

attributes of the dataset and facilitated better understanding of the data.

12


4. Survey of Polish ASR benchmarks A comprehensive survey was conducted to

identify existing benchmarks for Polish ASR systems. The survey involved analyzing

the available benchmarks, their methodologies and the datasets they used. Insights

were derived to highlight the gaps and areas for improvement in current Polish ASR

benchmarks.

5. Implementing system for ASR evaluation Developed a robust system to evalu-

ate ASR systems. This system included tools for automatic and manual assessment

of ASR output, incorporating various evaluation metrics such as WER (Word Error

Rate), CER (Character Error Rate), and others. The system was designed to be

scalable and adaptable for continuous benchmarking.

6. Benchmarking ASR systems for the Polish language The curated datasets

were used to evaluate and compare the performance of ASR systems for the Polish

language. In total, 25 models were evaluated. The results were made available to

the community through the ASR leaderboard.

7. Publication of Polish ASR leaderboard A publicly accessible ASR leaderboard

was developed, enabling comparison of the performance of the ASR system. Inter-

active dashboards were included to allow users to explore the results in detail and

compare different systems based on various criteria.

8. Organization of open ASR challenge The curated datasets were used to organize

an open challenge for the Polish ASR community. This challenge aimed to engage the

community in improving ASR technology for Polish and to benchmark new systems

against the curated datasets.

1.6 Limitations

This section lists the limitations of the research conducted.

1. Language specificity: The research is confined to the Polish language, a language

with distinct linguistic attributes. Its findings may not extend to ASR systems for

languages with divergent phonetic or grammatical structures.

13


2. Datasets selection: This study is based on a selection of publicly accessible Polish

speech datasets intended for ASR. The limited scope of datasets might influence the

applicability of the research to broader speech data contexts and corpus linguistic

research.

3. Data curation constraints: Collecting new speech or annotations is beyond this

work’s scope. Manual annotation was used to inspect existing data and validate

automatic curation methods. No new recordings or annotations were added.

4. Technological focus: The study focused on ASR technology, particularly speech-

to-text accuracy. Metrics like latency, real-time factor, voice biometrics, and down-

stream task evaluation were not considered.

5. Resource availability: Research on the accuracy of commercial ASR systems and

large ASR models was limited by funding and computational resources.

6. Temporal constraints: The study covers speech datasets available up to December

2023 and ASR systems up to March 2024.

7. Demographic and use case coverage: The research does not fully represent

all segments of the Polish-speaking population, including unique dialects or speech

variances.

8. Methodological boundaries:, Evaluation results are based on selected automatic

metrics. The linguistic and acoustic analysis was limited to selected aspects.

9. Commercial and academic solutions: The analysis included various commercial

and free ASR systems for Polish, though not all solutions are covered due to the

rapidly evolving landscape.

1.7 Methodology adopted

The methodology adopted in the research consisted of several steps listed below.

Survey of Polish ASR speech datasets The method consisted of a review of publicly

accessible information to catalog Polish ASR speech datasets Specific activities include:

• Literature review and identification of existing speech datasets.

14


• Development of a taxonomy classification framework. identify and

• Cataloging of speech datasets according to the framework.

• Developing a publicly accessible digital repository and dashboard.

Curation of datasets for Polish ASR systems evaluation The method utilized

publicly available sources to curate diverse datasets for Polish ASR development . Specific

activities include:

• Selection of speech datasets based on the curated data catalog.

• Data unification, normalization, and formatting.

• Developing a publicly accessible digital repository and dashboard.

Evaluation of ASR Systems for Polish The method used curated datasets to

compare ASR systems in various scenarios. Specific activities include:

• Selecting evaluation metrics

• Evaluating ASR systems using recordings from curated datasets

• Analyzing performance, highlighting strengths and weaknesses

• Developing a public dashboard with results

Organization of Polish ASR challenge

Curated datasets were used to organize open competition to allow the comparison of state-

of-the-art ASR systems with community-developed systems. Specific activities include:

• Selecting a competition platform.

• Establish participation and evaluation guidelines.

1.8 Contributions

Below are the major contributions of this work to the Polish ASR field:

1. Creation of the largest Polish ASR speech data catalog, documenting 53 datasets

with 65 attributes.

15


2. Development of a metadata schema for cataloging ASR speech datasets.

3. Analysis of the current state of the Polish ASR datasets and the proposal of future

research directions.

4. Distribution of two datasets curated from 24 publicly available datasets.

5. Performing and sharing the analysis of the content of curated datasets.

6. Performing the survey and creating the catalog of Polish ASR benchmarks.

7. Development of an extensible system for ASR evaluation.

8. Comprehensive evaluation of Polish ASR systems involving 7 systems, 25 models and

24 datasets

9. Development of a publicly accessible ASR leaderboard with interactive dashboards.

10. Improvement of reproducibility and guidance for future ASR advancements by pro-

viding public access to data catalogs, curated datasets, evaluation tools, and dash-

boards.

11. Organization of an open challenge for the ASR community using curated datasets.

16


Chapter 2

Literature Review

2.1 Introduction

This section presents literature relevant to the following topics:

• Challenges in benchmarking of Machine Learning and ASR systems.

• Challenges, methods and tools for the management of ASR speech datasets.

• ASR speech datasets and benchmarks for the Polish language.

Based on the review, relevant datasets, methods and tools required to create research

artifacts and achieve research objectives were selected.

2.2 Benchmarking of Machine Learning Systems

2.2.1 Challenges in ML benchmarking

Liao et al. provides a comprehensive overview of challenges and systemic issues in bench-

marking practices in various subfields of machine learning (ML) [72]. In the meta-review,

the authors studied more than 107 articles that describe benchmarks from subfields such

as computer vision, natural language processing, recommender systems, and reinforce-

ment learning. The major conclusion is that the inconsistency in evaluation standards

and methodologies has led to claimed advances in machine learning that do not withstand

thorough examination or do not possess the broad applicability initially assumed.

The authors introduced concepts of internal and external validity of ML evaluations.

Internal validity concerns the “correctness and fairness of evaluations in the context of a


Figure 2.1: Internal and external issues identified in the ML evaluation practices. Source:
[72]

specific learning problem".[72]. Internal validity is negatively affected by incorrect baseline

comparisons, errors in the construction of the test set, and overfitting due to test data leak-

age. External validity, on the other hand, refers to the ’applicability and generalizability

of the evaluation findings in different learning problems, tasks, or real-world scenarios’[72].

In case of misalignment of metrics and dataset with respect to the real-world scenario,

the benchmark result may not accurately reflect the progress or performance of the ML

application under the target conditions. Failures of both types are common and contribute

to a misleading representation of progress within the ML field. Figure 2.1 presents spe-

cific issues of internal and external validity throughout the ML lifecycle. The authors

also propose a useful distinction between terms that are often used interchangeably in the

ML benchmarking context: learning problems and tasks. A learning problem comprises a

dataset of input and output pairs and an associated evaluation metric to score the proposed

solutions (functions that correspond to the input space). The example is the Librispeech

dataset with WER as a metric to score ASR systems. A task is described in a more general

manner, either in the everyday language or formally. There is no fixed definition of a task,

and the goal is not to set specific task definitions. Tasks can be found at different levels

of detail, for example, from ’dog vs. cat classification’ to ’animal classification’ to ’image

classification’, which naturally gives rise to a hierarchy (see Figure 2.2 ). For the purpose

18


Figure 2.2: ML tasks and learning problems universe. Source: [72]

of evaluation, tasks are usually instantiated by learning problems. Given the above def-

initions, a “benchmark is a learning problem framed as an indicator of progress on some

task ” [72]. Benchmarks typically include a ranking system, contest, or other framework

that defines the current state-of-the-art. Enhancing WER performance on the English Lib-

rispeech dataset can be seen as an improvement in ASR task, but only within the specific

scope and use case determined by the dataset.

The recommendations to improve the robustness and reliability of ML benchmarks

include:

1. adoption of more rigorous experimental designs

2. improved documentation standards

3. sharing of research artifacts, enabling replication and inspection

4. development of benchmarks that more accurately reflect real-world conditions.

2.2.2 Examples of methods for curating ML benchmarking datasets

Introduction

Evaluation of ML solutions can be challenging. Factors such as the specific learning prob-

lem, the task at hand, the context of the application, and the objectives of the study must

be taken into account for the benchmark to be useful. In addition, evaluation datasets

are available from various sources, but their formatting, documentation, or access methods

are often inconsistent. As a result, choosing and organizing the evaluation process can

19


be an additional burden for ML professionals and data scientists. Therefore, accessible,

curated, and maintained public benchmark resources are essential to identify the strengths

and weaknesses of different ML methodologies. The curation involves several processes

to ensure the utility of the datasets for benchmarking purposes. This section presents

examples of such curation processes and selected methods based on examples of popular

benchmarks from various ML subfields.

Examples of datasets curated for benchmarking purposes

Penn Machine Learning Benchmark (PMLB) alpha 2017 [96] is a curated collec-

tion of 165 datasets from a wide range of sources covering real-world, simulated and toy

problems. The datasets were standardized with numerically encoded categorical features.

Instances with fewer than 10 examples per class were removed to maintain reasonable

learning scenarios. The curated datasets were then made available via a Python interface

to simplify retrieval and working with the data. The authors performed a comparison of

meta-features of datasets and found that they lacked the diversity to properly benchmark

ML algorithms. The study also identified datasets for which the corresponding benchmarks

matched or exceeded human baselines or achieved a plateau in performance, resulting in a

so-calledbenchmark saturation. The study also identified more challenging datasets, offer-

ing a range of difficulties to test Machine Learning methods. The original 2017 article was

presented as an ongoing project and is still being developed.

Penn Machine Learning Benchmark (PMLB) v1.0 2020[121] The updated ver-

sion of the PMLB benchmarking suite was released in 2020 1. The original collection that

covered classification tasks has been expanded to include regression tasks. Each dataset

has been enhanced with a standardized metadata file that contains information about its

original source, purpose description, related publications, keywords, and details about in-

dividual features and their coding schemes. The structured metadata format simplified the

validation process, leading to improved data accuracy and easier addition of new datasets

by the community. The user experience has been enhanced with a new contribution guide

and an improved website interface that allows browsing, sorting, filtering, and searching

for datasets. Support for the R library was also added. Pandas-profiling reports for each

1https://epistasislab.github.io/pmlb/

20


dataset were added that cover feature correlations and identification of duplicates and

missing values, allowing users to make informed decisions regarding necessary modifica-

tions prior to using a specific dataset.

GLUE 2019 The GLUE (General Language Understanding Evaluation) benchmark2

is the collection of tools and assembly of existing datasets for nine NLP tasks, such as

question answering, sentiment analysis, and textual entailment. GLUE includes test data

that were never made public and a hand-crafted diagnostic dataset for detailed linguistic

analysis. Manually annotated examples serve as a tool for error analysis, qualitative model

comparison, and the development of adversarial examples [145]. The benchmark focus is

not to reflect overall performance or generalization in downstream applications, but rather

to understand the performance of general versus specialized models and their capabilities

and limitations in handling complex linguistic phenomena.

SUPERGLUE 2020 SuperGLUE [144] builds on its forerunner, the GLUE bench-

mark, by incorporating a range of more challenging language comprehension tasks. Su-

perGLUE was developed in response to the realization that performance on the GLUE

benchmark exceeded that of non-specialist humans. New tasks were collected by issuing

an open invitation for task suggestions within the NLP community. The tasks were se-

lected based on their level of challenge for existing NLP methods and covered a variety of

formats, such as coreference resolution and question answering. The datasets were derived

from preexisting data to guarantee availability and consistency. The tasks must have avail-

able public training data, have an automatic performance measure that correlates well with

human evaluation, and should not require specialized knowledge beyond standard English

proficiency. Human performance benchmarks were established for all tasks, ensuring am-

ple scope for enhancing model performance. The benchmark was launched with a modular

toolkit that facilitates model training, testing, and assessment. This toolkit was based on

commonly used frameworks such as PyTorch and includes conventional models like BERT

for initial evaluations. The leaderboard3 was structured to promote fair competition and

meaningful comparisons of models. The guidelines for submissions are explicit on data

usage and the tasks are designed to reduce overfitting and enhance the interpretability of

model performance across a range of NLP tasks.

2https://gluebenchmark.com/
3super.gluebenchmark.com

21


MMLU 20214 The Massive Multitask Language Understanding (MMLU) benchmark

is designed to assess text models across a broad spectrum of fields and complexity levels.

MMLU covers 15,908 questions from 57 topics. The questions were manually collected by

graduate and undergraduate students from openly accessible online resources. The few-shot

development (training) set has 5 questions for each subject, the validation set has 1,540

questions, and the test set has 14,079 questions. Each subject has questions of different

difficulty levels, from elementary to high school, college, and professional. This enables

one to gauge the depth of knowledge of a model and its capacity to deal with increasingly

difficult content. Baseline results from both non-specialized human test-takers and experts

are available. This comparison offers a context for assessing the performance of language

models in relation to human abilities. The MMLU is designed for zero-shot and few-

shot settings to evaluate the ability of models to generalize and apply knowledge without

extensive fine-tuning, as in many real-world scenarios.

BIG-Bench 20225 BIG-bench[135], which stands for Beyond the Imitation Game, is a

benchmark for language models, comprising 204 tasks put forward by 450 authors from 132

different institutions. The tasks are varied and cover a wide range of topics, including lin-

guistics, childhood development, mathematics, common sense reasoning, biology, physics,

social bias, software development, and more. BIG-bench’s emphasis is on tasks that are

thought to exceed the abilities of current language models. The tasks come in various

formats, such as multiple choice and text-complete questions. The curation process was

carried out transparently and cooperatively. Contributions were collected through GitHub

pull requests and then subjected to a peer review process. This approach guaranteed a

broad spectrum of tasks and viewpoints. Expert human raters were employed to complete

all tasks, establishing a reference point to evaluate the performance of the language models.

BIG-bench was created with the intention of facilitating the ongoing contributions of tasks

and evaluations, ensuring its continued relevance.

SUPERB 20216 SUPERB (Speech Processing Universal PERformance Benchmark)

[139] is a toolkit and leaderboard to benchmark the performance of a shared model in

a wide range of speech processing tasks with minimal architecture changes and labeled

4https://huggingface.co/datasets/cais/mmlu
5https://huggingface.co/datasets/bigbench
6https://arxiv.org/abs/2105.01051

22


data. Multiple speech processing is included, for example, phoneme recognition, automatic

speech recognition, keyword spotting, speaker identification, speaker verification, speaker

diarization, intent classification, slot filling, and emotion recognition. For the dataset to

be included in the benchmark, it must adhere to the conventional protocols accepted by

the speech community, be publicly accessible, and allow universal participation. datasets

considered to be the standard benchmarks for various tasks are included, e.g.

• LibriSpeech: Used for phoneme recognition and automatic speech recognition tasks.

• Speech Commands V1.0: Utilized for keyword spotting to detect predefined words.

• VOXCELEB1 : Employed for speaker identification and verification tasks.

• Fluent Speech Commands: Used for intent classification.

• IEMOCAP : Chosen for emotion recognition tasks.

Each task has specific metrics for evaluation, such as the WER for speech recognition, the

accuracy for keyword spotting and speaker identification, and the diarization error rate

(DER) for speaker diarization. The benchmark goal is to encourage the development of

models that can perform well on diverse speech processing tasks with minimal specific

tuning for each task.

ASR-GLUE 2022 ASR-GLUE [29] is a benchmark to study the effect of ASR error

on NLU tasks in terms of noise intensity, error type and speaker variants. Six NLU tasks

that are prevalent in speech-based scenarios are included: sentiment analysis, paraphrase

detection, and natural language inference. Data instances were manually selected from

existing NLU task datasets. The selection criteria excluded samples with non-standard

words or overly long sentences to ensure clarity and quality in speech-to-text conversion.

Six native speakers recorded the selected test samples in different noise environments. This

was done to simulate real-world speech variations and introduce controlled ASR errors. The

recordings were converted to text using an ASR system trained for this purpose. For tasks

that require labeled data, the dataset maintained the original labels of the source datasets,

ensuring that the impact of ASR errors could be assessed against known outcomes. The

dataset is maintained by Tencent AI Lab, is publicly available, and open to community

contributions.7
7ASR GLUE audio

23

https://drive.google.com/drive/folders/1slqI6pUiab470vCxQBZemQZN-a_ssv1Q


ESB 20228 The End-to-End Speech Benchmark (ESB) [34] aims to evaluate ASR

systems in various domains, eliminating the need for domain-specific adjustments. ESB

consists of a range of speech datasets from various domains, including audiobooks, polit-

ical speeches, educational talks, among others. Data instances are sourced from existing

datasets such as LibriSpeech, Common Voice, VoxPopuli, TED-LIUM, GigaSpeech, SPGIS-

peech, Earnings-22, and AMI. The source datasets of ESB are freely available and accessi-

ble datasets to encourage broad participation and usage in the speech research community.

Transcription artifacts, such as punctuation and casing, which are usually normalized in

many ASR systems, are preserved in this benchmark to enhance the complexity and realism

of speech recognition tasks. The diagnostic dataset with manually verified transcriptions

is used for the public leaderboard available on the Hugging Face platform. 9

2.3 Benchmarking of Automatic Speech Recognition Systems

This section presents a relevant work on the problem of evaluation of ASR systems. Popular

methods, metrics, taxonomies, and analysis frameworks are discussed, along with known

challenges and design considerations.

2.3.1 Introduction

The evaluation process involves a numerical measurement of the usefulness of the output

generated automatically for a given Machine Learning task. In case of ASR, typically

aSpeech dataset and WER metric are used to represent Machine Learning task as a spe-

cific Learning problem [72]. For example, the English ASR task can be assessed as a

learning problem consisting of Librispeech Speech dataset and the metric WER[100]. The

task of automatic recognition of Polish customer support conversations can be defined as

the learning problem using the DiaBiz corpus and WER metric [112, 110]. The task of

recognizing clean English speech defined using the Librispeech dataset reached the stage

of benchmark saturation[148]. Furthermore, ASR systems can show on-par performance

with humans on one set of Benchmark datasets and subpar accuracy across other set of use

cases. As reported by Likhomenanko et al. “No single validation or test set from public

8https://huggingface.co/datasets/esb/datasets
9Open ASR Leaderboard

24

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard


datasets is adequate to gauge transferability to other public datasets or to real-world audio

data” [73]. ASR systems based on an end-to-end architecture could even generate incoher-

ent output when tested on speech from a domain that was not present in the training data

[55]. Furthermore, the error rates of contemporary ASR systems evaluated on popular

datasets can be lower than those achieved by trained humans [152]. Given the limited

transferability of the evaluation results between learning problems and datasets, Aksenova

et al. [2] suggest that the ultimate objective of the ideal ASR benchmark should be to

verify the capacity of the ASR system to generalize in a wide range of use cases. Methods

for comparing systems or ASR technologies can be classified as subjective or objective [15].

Subjective methods involve humans in the evaluation process and are best suited to assess

the impact of ASR recognition error and root cause [101, 58, 28] or validate the quality

of the evaluation data [148]. Their drawback is the inconsistency in quality assessment by

human subjects and the cost of applying at scale. Objective methods offer the advantage

of generating reproducible results because they do not require human involvement. Their

key benefit is automation, with the resulting lower cost and faster execution. However,

effectively evaluating the practical usability of ASR output in the context of the target

application remains a challenge due to the complexity of the processes involved [104, 137].

To decide which system offers the best performance, relying solely on accuracy metrics

such as WER may not be enough. Additional metrics to be considered include latency

(real-time factor RTF [127] or precision in the downstream task [129].

2.3.2 Overview of ASR benchmark design considerations

The following aspects have impact on the utility of ASR benchmark:

• scope of evaluated ASR systems,

• diversity of datasets and use scenarios,

• reliability of datasets,

• diversity of analysis dimensions,

• availability of evaluation results,

• reproducibility of evaluation results.

25


ASR systems, with their wide range of applications and tasks, should ideally be resilient

to different types of speech input variation. For instance, an ASR system that generates

automatic captions for video meetings should be capable of recognizing words from diverse

semantic fields, adjusting to the meeting’s subject. The characteristics of speech can

also differ across various contexts: for instance, the style of speech used for dictating text

messages is different from that of a group discussion, where participants might occasionally

interrupt each other. Therefore, the benchmark can cover many ’horizontal ’ and ’vertical ’

challenges [2]. Horizontal challenges refer to ASR use cases, while vertical challenges refer

to diversity of subjects, encoding formats, etc. The authors argue that “the more horizontal

and vertical areas are covered by a benchmark, the more representative it will be, and hence

it is more appropriate to measure ASR progress”.[2] These challenges and related aspects

are discussed in more detail in the following subsections.

2.3.3 ASR use scenarios

Ideally, the benchmark for ASR systems covers many ASR use cases. The best way to

represent various usage scenarios is the creation of a comprehensive Speech dataset, either

by merging existing datasets [73, 12] or by collecting new data to fill the gaps. Aksenova

et al. [2] proposed a taxonomy of ASR use cases based on their experience developing

an ASR-based customer-facing product at Google. The overview of the challenges and

differences in the use cases can be found in tables 2.1 and 2.2, respectively.

Text dictation function is to enable the input of text into a digital device with-

out manual typing. Typically, it involves relatively slow speech from a single speaker.

As the user consciously interacts with a device, the speech is adjusted to maximize the

chance of correct understanding [18]. Typical applications include general purpose dicta-

tion on desktop / mobile / portable devices, medical records transcription [78, 87], legal

proceedings transcription[41, 23], language learning with computer-aided pronunciation

feedback[82, 119] and speech-to-speech translation [134].

Voice search and control allow individuals to retrieve information or perform tasks

through verbal commands. Speech patterns have human-to-device interaction characteris-

tics and often contain specific nouns required to perform the task, for example, navigate to

a location of interest or play a song on a streaming service. Another example is interactive

26


voice response (IVR) applications, where individuals contacting customer service engage

with a voice-operated chatbot. This chatbot can either assist in collecting data before

transferring the call or be capable of addressing the problems on its own. [86]

Voicemails, oration, and audiobooks scenarios include using the ASR system to

provide transcription for voicemail messages [48, 5], parliamentary speeches [65, 66, 143,

35, 107, 57, 62, 76, 51, 67, 133], and audiobooks [115, 100]. In these scenarios’ speech

typically originate from a single speaker. Spontaneity artifacts such as hesitations, fillers,

back-channel speech, disfluencies, false starts, and corrections are present [37, 84]. In case

of audiobooks the human-to-human speech features are less prevalent[50].

Conversations and meetings scenario typically involves transcribing spontaneous

speech among several participants within a single audio recording. As with voicemails,

oration and audiobooks, this type of speech is considered human-to-human speech. The

presence of noise, overlapping, and distant speech adds to the challenge of recognizing

spontaneous speech [54]. Practical applications include the transcription of video meeti