On Leveraging Multi-Page Element Relations in Visually-Rich Documents

Davide Napolitano; Lorenzo Vaiani; Luca Cagliero

doi:10.1109/COMPSAC61105.2024.00057

2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)

On Leveraging Multi-Page Element Relations in Visually-Rich Documents

Year: 2024, Pages: 360-365

DOI Bookmark: 10.1109/COMPSAC61105.2024.00057

Authors

Davide Napolitano, Politecnico di Torino,Turin,Italy
Lorenzo Vaiani, Politecnico di Torino,Turin,Italy
Luca Cagliero, Politecnico di Torino,Turin,Italy

Abstract

Thanks to the rapid progress of the digitalization process, Visually-Rich Documents (VRDs) such as PDF files or scanned documents have become among the most widespread sources of knowledge. However, Question Answering on VRDs is challenged by the presence of multi-page relationships between document elements such as tables, figures, sections. This paper addresses a specific Visual Question Answering subtask from VDRs where answer generation leverages pairwise element relations in multi-page documents. We explore the performance of text-only and multimodal Transformer-based architectures as well as open-source Large Language Models. The results show that multimodal Transformers outperform the other tested methods, particularly when training samples contain explicit textual references to the elements in the document layout.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Information Extraction from Visually Rich Documents with Font Style Embeddings
2022 26th International Conference on Pattern Recognition (ICPR)
Semantic Parsing of Interpage Relations
2022 26th International Conference on Pattern Recognition (ICPR)
DAN: A Segmentation-Free Document Attention Network for Handwritten Document Recognition
IEEE Transactions on Pattern Analysis & Machine Intelligence
SelfDoc: Self-Supervised Document Representation Learning
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
HRVDA: High-Resolution Visual Document Assistant
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
GRAM: Global Reasoning for Multi-Page VQA
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
TransGPT: Multi-modal Generative Pre-trained Transformer for Transportation
2024 International Conference on Computational Linguistics and Natural Language Processing (CLNLP)

On Leveraging Multi-Page Element Relations in Visually-Rich Documents

Authors

Abstract

Related Articles