Nowe metody i zbiory danych do inteligentnego przetwarzania dokumentów

Jurkiewicz, Dawid

Nowe metody i zbiory danych do inteligentnego przetwarzania dokumentów

Files

PhD_Thesis_Dawid_Jurkiewicz.pdf (19.79 MB)

Date

2024

Authors

Jurkiewicz, Dawid

Advisor

Graliński, Filip. Promotor

Title Alternative

Novel Methods and Datasets for Intelligent Document Processing

Abstract

Rozprawa podejmuje dwa kluczowe obszary w ramach dziedziny inteligentnego przetwarzania dokumentów (ang. Intelligent Document Processing): identyfikację relewantnych fragmentów tekstu (ang. Span Identification) i problematykę rozumienia dokumentów (ang. Document Understanding). Duży nacisk położony jest na zmierzenie się z wyzwaniami związanymi z małą ilością dostępnych danych. Aby rozwiązać ten problem, zaproponowano zbiór danych dla identyfikacji relewantnych fragmentów tekstu na podstawie kilku przykładów oraz unikatową metodę do wyszukiwania podsekwencji na podstawie kilku przykładów. Oprócz rozwiązań bazujących na kilku przykładach, przedstawiono metody do identyfikacji i klasyfikacji fragmentów tekstu zawierających propagandę. Ponadto wprowadzono multimodalny model oparty na architekturze Transformer dla problematyki rozumienia dokumentów. Model rozumie semantykę tekstu, cechy wizualne i strukturę dokumentu oraz potrafi odpowiadać na różne sformułowania w języku naturalnym dotyczące dokumentu. Dodatkowo zaproponowano pierwszy zestaw zbiorów danych pozwalający społeczności na dokładną obserwację postępów w dziedzinie rozumienia dokumentów. Na koniec zaprezentowano wymagający konkurs dla problematyki rozumienia dokumentów zawierający nowatorskie pary typów pytań i odpowiedzi dla wielodomenowych, wielobranżowych i wielostronicowych dokumentów. This thesis aims to contribute innovative solutions and datasets to the Intelligent Document Processing (IDP) domain. The focus is set on two key areas within IDP: Span Identification (SI) and Document Understanding (DU). Significant emphasis is placed on addressing the challenges posed by low-data scenarios, which are prevalent in various business use cases. A few-shot SI dataset and a unique approach for sub-sequence matching with few examples are proposed to address this. Besides the few-shot setting, methods for identifying and classifying propaganda spans are presented. Furthermore, a multi-modal end-to-end Transformer-based model for Document Understanding is introduced. The model efficiently comprehends layout information, textual semantics, and visual cues present in the document and can answer various document-related questions posed in the natural language. Additionally, the first DU benchmark is proposed, allowing the community to measure the DU field's state accurately. Lastly, a challenging DU competition is showcased. The task features novel question and answer type pairs over multi-domain, multi-industry, and multi-page documents, encouraging the development of solutions with strong generalization capabilities in low-data regimes.

Description

Wydział Matematyki i Informatyki

Keywords

przetwarzanie języka naturalnego, uczenie maszynowe, identyfikacja relewantych fragmentów tekstu, rozumienie dokumentów, ekstrakcja informacji, natural language processing, machine learning, span identification, document understanding, information extraction

URI

https://hdl.handle.net/10593/27672

Collections

Doktoraty 2010-2026 /dostęp ograniczony, możliwy z komputerów w Bibliotece Uniwersyteckiej/
Doktoraty (WMiI)

Full item page Statistics

Nowe metody i zbiory danych do inteligentnego przetwarzania dokumentów

Files

Date

Authors

Translator

Advisor

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Title Alternative

Abstract

Description

Sponsor

Keywords

Citation

Series

ISBN

ISSN

URI

DOI

Title Alternative

Collections

Endorsement

Review

Supplemented By

Referenced By