OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises.

To overcome this, researchers have introduced OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG.

OHRBench and evaluation protocol consist of the following (1) Benchmark Dataset: collect PDF documents from six domains, extract human-verified ground truth structured data, and generate Q&As derived from multimodal document elements. (2) RAG Knowledge Base: OCR Processed Structured Data for benchmarking current OCR solutions and Perturbed Structured Data for assessing the impact of different OCR noise types. (3) Evaluation of OCR impact on each component and the overall RAG system.

To better understand OCR’s impact on RAG systems, let us understand two primary types of OCR noise: Semantic Noise, resulting from prediction errors exerts significant impact, and Formatting Noise, arising from non-uniform document element representation affects specific retrievers and LLMs differently. offering valuable insights for developing RAG-tailored OCR solutions and noise robust models.

Furthermore, employing Vision-Language Models (VLMs) without OCR in RAG systems can be an effective alternative. VLM can improve the performance by up to 24.5% and approach the performance of the ground truth text baseline, indicating its promising potential of applying VLMs in RAG systems.

Paper : OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation