RSAT Publication | Jugal Gajjar

Abstract

When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1–8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families—Qwen2.5 (1.5B/3B/7B) and Llama3 (1B/3B/8B)—RSAT improves faithfulness 3.7ｘ over SFT alone (0.224→0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.

Related Work

RSAT addresses a critical gap in table reasoning with LLMs: models often produce correct answers but fail to justify them with accurate evidence. Our approach enforces structured outputs consisting of step-by-step reasoning paired with explicit cell references, enabling verifiable and interpretable predictions.

Methodology

RSAT follows a two-phase training paradigm designed to enforce both structured reasoning and faithful attribution. In the first phase, Supervised Fine-Tuning (SFT) is performed on a curated dataset of high-quality reasoning traces, where each example contains step-by-step intermediate reasoning along with explicit cell-level citations. This phase teaches the model the required output format, including JSON-style structured responses, citation grounding, and logical decomposition.

In the second phase, the model is optimized using Group Relative Policy Optimization (GRPO), a reinforcement learning method that removes the need for a critic model and enables direct scoring of generated outputs. For each query, multiple candidate reasoning traces are sampled, and a multi-component reward function evaluates them based on: answer correctness, citation precision, step-level attribution alignment, and format validity.

The optimization objective encourages the model to prefer outputs that are not only correct but also fully grounded in evidence. By combining structured supervision with reward-driven refinement, RSAT effectively learns to produce faithful, interpretable reasoning traces without increasing model size.

Results

RSAT achieves consistent improvements across all evaluated models and metrics, with the most significant gains in attribution faithfulness. Averaged across six models, RSAT reaches a faithfulness score of 0.826, compared to 0.224 for SFT-only—representing a 3.7× improvement. Importantly, this gain does not come at the cost of answer quality: Answer F1 improves by +0.09 on average. Structural metrics such as citation validity and format success remain near-perfect (>0.97), confirming that the GRPO stage enhances grounding while preserving the structured reasoning learned during SFT.

Further analysis shows that SFT primarily contributes structural correctness (e.g., +0.61 format gain), while GRPO drives faithfulness improvements (+0.60). Scaling experiments reveal that faithfulness improves with model size, with smaller models (e.g., Qwen 1.5B) already achieving strong grounding (0.847), and larger models approaching saturation (>0.97). In contrast, post-hoc attribution methods collapse, averaging only 12.7% format success, highlighting the importance of generating citations during reasoning rather than retrofitting them afterward.

Project & Paper Links

Project Page View Paper