Structrag: Boosting Knowledge Intensive Reasoning Of Llms Via Inference-Time Hybrid Information Structurization

Retrieval-augmented generation (RAG) is a key means to effectively enhance large language models (LLMs) in many knowledge-based tasks. However, existing RAG methods struggle with knowledge-intensive reasoning tasks, because useful information required to these tasks are badly scattered. This characteristic makes it difficult for existing RAG methods to accurately identify key information and perform global reasoning with such noisy augmentation.

Recently, LLMs have been explored for human-like thinking processes to transform scattered information into various structure formats during inference, thereby better serving knowledge-intensive reasoning tasks.

Motivated by this, researchers have proposed StructRAG, which employs a hybrid information structuring mechanism to construct and utilize structured knowledge in the most suitable format based on task requirements.

The StructRAG framework consists of three sequential modules aimed at effectively processing and utilizing structured knowledge. 1. Hybrid Structure Router: This module identifies the most suitable structure type for a given task based on the question and document information. 2. LLM-based Scattered Knowledge Structurizer: This component converts raw documents into structured knowledge, leveraging strong comprehension and generation capabilities. 3. Structured Knowledge Utilizer: This final module handles complex questions by decomposing them and extracting relevant knowledge, enhancing the accuracy of the final answer. Together, these modules optimize the use of structured knowledge for knowledge-intensive reasoning tasks.

Furthermore, in order to get a high-performance hybrid structure router, training data were construct by a synthesizing-simulating-judging pipeline and then implement preference training via DPO algorithm. Experiments on extensive knowledge-intensive reasoning tasks demonstrate that StructRAG is an effective solution, which reaches the SOTA performance and can achieve large improvement in badly information-scattered scenarios.

Paper: StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-Time Hybrid Information Structurization