Adaptive RAG Publication

Adaptive Retrieval-Augmented Generation: Learning When and How Much to Retrieve Across Model Scales - Publication

Abstract

Retrieval-Augmented Generation (RAG) systems typically use fixed retrieval strategies, retrieving the same number of documents regardless of question complexity or model capacity. We present Adaptive RAG, a reinforcement learning framework that learns when and how much to retrieve based on question difficulty and model confidence. Using Deep Q-Learning across 7 models (3.8B–120B parameters) on 199,847 questions from 9 QA datasets, we show Adaptive RAG achieves 3.2–6.5% higher accuracy while using 14–37% fewer retrievals than fixed baselines. Efficiency gains increase with model scale, with 120B models showing 37% retrieval reduction versus 14% for 4B models. High retrieval variance (σ > 0.95) confirms genuine adaptive behavior.

Motivation

Traditional RAG systems apply a fixed retrieval strategy regardless of query difficulty. This leads to unnecessary computation for simple queries and insufficient reasoning for complex ones. Adaptive RAG introduces dynamic routing to optimize both performance and efficiency.

Methodology

Adaptive RAG introduces a dynamic retrieval-routing mechanism that selects the most appropriate reasoning strategy based on query complexity and information requirements. The pipeline begins with a lightweight query analyzer that classifies incoming queries into different categories such as simple lookup, multi-hop reasoning, or ambiguous queries.

Based on this classification, a controller module dynamically routes the query through one of multiple execution paths. Simple queries are handled using single-pass retrieval with minimal latency, while complex queries trigger multi-hop retrieval, where evidence is iteratively fetched and refined across multiple steps. For ambiguous or underspecified queries, the system performs iterative retrieval-generation loops to progressively refine context.

The retrieved documents are then integrated into the generation phase using context-aware prompting, ensuring that responses remain grounded in external knowledge. Additionally, the framework incorporates relevance scoring and context pruning to filter noisy information and maintain efficiency. This adaptive design enables a balance between accuracy, latency, and computational cost.

Results

Adaptive RAG consistently outperforms fixed retrieval strategies across models, achieving a superior balance between accuracy and efficiency. Accuracy improvements over the commonly used Fixed-3 baseline range from +5.3% (small models) to +3.2% (large models), while reducing the number of retrievals by up to 37%. For example, on Qwen3-4B, Adaptive RAG achieves 0.764 accuracy with only 2.58 retrievals, outperforming all fixed strategies. Similar gains are observed across scales, including large models such as GPT-OSS-120B, which reaches 0.874 accuracy with just 1.90 retrievals on average.

The method demonstrates true adaptive behavior, with retrieval variance (σ = 0.95–1.52) confirming that the system dynamically adjusts based on question difficulty. For simple queries, it uses as few as 1.4–1.9 retrievals, while harder queries trigger up to 4 retrievals, leading to accuracy gains of up to 5.1%. Scaling analysis further shows that larger models require fewer retrievals due to stronger parametric knowledge, making Adaptive RAG increasingly efficient as model size grows.

Project & Paper Links

Project Page View Paper

← Back to Publications

AdaptiveRAG.app | Super Mario Edition | Built by Jugal Gajjar Ready