Project Page Frugal Thinking

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. We match AIME25 accuracy while producing much shorter solutions without explicit length penalties.
1 MBZUAI 2 École Polytechnique

Abstract

Large language models trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines conventionally filter out “easy” problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates “thinking longer” with “thinking better”. We show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer: rewards become associated with concise solutions early, preventing runaway verbosity without any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507 achieve baseline pass@1 accuracy on the AIME25 benchmark while generating solutions that are, on average, nearly twice as short.

Key Ideas

Easy samples provide stable positive signal associated with short, correct reasoning traces. This regularizes length implicitly while preserving accuracy.

We retain problems with non-trivial success probability and train with a fixed context budget. This shifts reward toward concise solutions without any explicit length penalty.

Contributions

- Implicit length regularization by emphasizing moderately easy problems in RLVR.

- Empirical validation on Qwen3-4B and Qwen3-30B-A3B with strong AIME25 accuracy.

- Public release of Frugal-Thinking models and artifacts.

Observations

We followed a double-stage RLVR training:
- Stage 1 keeps moderately easy samples to encourage concise reasoning trajectories under a 16k token budget.
- Stage 2 applies curriculum RLVR on a filtered DeepMath subset to expand coverage while preserving brevity.

Reasoning Performance Evaluation
AIME25 Test-Time Scaling Context 8k → 42k
Test-Time Scaling — AIME25 Accuracy
AIME25 pass@1 accuracy by context length (8k → 42k).
Lines show accuracy across context length.

Training Dynamics


Stage 1: Early training is dominated by overly long, truncated generations with high entropy and low accuracy. As optimization progresses, average response length and clip ratio drop sharply, entropy stabilizes, and AIME25 validation accuracy rises steadily, indicating that concise reasoning and correctness improve together.

Stage 2: Curriculum RLVR preserves the same concise behavior while expanding coverage to harder problems. We observe the minimum response length rising to ~1.2k tokens due to increased difficulty, without reverting to the excessive verbosity seen early in Stage 1.