Download PDFOpen PDF in browser

Embedding Layout in Text for Document Understanding Using Large Language Models

EasyChair Preprint no. 12130

14 pagesDate: February 15, 2024


In this paper, we address the challenge of effectively utilizing Large Language Models (LLMs) for Visually Rich Document Understanding (VRDU), a key part of intelligent document processing systems. While LLMs excel in various Natural Language Processing (NLP) tasks, their application for extracting information from complex structured documents like invoices and forms is limited. This limitation arises from the difficulty in contextually understanding these documents, largely due to the lack of layout information. Our research is dedicated to unlocking the full potential of LLMs for VRDU by integrating OCR data into an HTML format, which preserves the essential spatial layout for accurate information extraction. The empirical results show a notable improvement, with a more than 20 percent increase over baseline performances. This research highlights the promising potential of LLMs in VRDU and sets the stage for further innovations in automated document processing.

Keyphrases: document understanding, Information Extraction, Large Language Model

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Mohammad Minouei and Mohammad Reza Soheili and Didier Stricker},
  title = {Embedding Layout in Text for Document Understanding Using Large Language Models},
  howpublished = {EasyChair Preprint no. 12130},

  year = {EasyChair, 2024}}
Download PDFOpen PDF in browser