The Entropy-Gradient Inversion: A New Perspective on LLM Reasoning Capabilities
熵-梯度倒置:LLM推理能力的新视角
Junyao Yang
February 7, 2026杨竣尧
2026年2月7号
Abstract: While Chain-of-Thought (CoT) and reasoning models (e.g., DeepSeek-R1, o1) have demonstrated remarkable capabilities, their internal mechanisms remain largely opaque. This study introduces a physics-inspired metric—the Nuclear Norm of Logit Gradients—to characterize long-chain reasoning. We discover that reasoning models exhibit a unique "fingerprint": a significant negative correlation between gradient strength and token entropy. This contradicts traditional base models. Furthermore, we find this capability emerges rapidly within the first 200 steps of SFT, whereas Reinforcement Learning (RL) exhibits early oscillation before convergence.
摘要: 尽管 Chain-of-Thought (CoT) 和推理模型(如 DeepSeek-R1, o1)展现了惊人的能力,但其内部运作机制在很大程度上仍是一个"黑盒"。本研究提出了一种基于物理学启发的指标——Logit 梯度的核范数 (Nuclear Norm of Logit Gradients),来表征模型的长思维链能力。研究发现,推理模型在生成过程中表现出一种独特的"指纹":梯度强度与 Token 熵之间呈显著负相关。这种相关性与传统基座模型截然相反。进一步的实验揭示了这种能力在 SFT 阶段的前 200 步内迅速形成,而在纯 RL (Cold Start) 阶段则表现出早期的震荡与随后的收敛。
1. Background: The Entropy of "Thinking"
1. 背景:思考的熵 (Entropy)
Before diving into the gradient analysis, it is crucial to understand the behavior of "thinking tokens" in modern reasoning models. Recent research provides two key insights:
Thinking Tokens are Information Peaks: According to "Demystifying Reasoning Dynamics with Mutual Information", reasoning steps—such as "Let", "Suppose", or "However"—often manifest as high-entropy peaks. These tokens represent moments where the model branches out to explore complex logical paths, distinguishing them from the low-entropy "regular" tokens used in standard text generation.
High-Entropy Minority Drives Learning: The study "Beyond the 80/20 Rule" highlights that while 80% of tokens are low-entropy, the "high-entropy minority" (the remaining 20%) are the critical drivers for effective Reinforcement Learning in reasoning tasks.
The Conflict: While high entropy is a hallmark of deep reasoning, this new study reveals a surprising twist: for reasoning models, these high-entropy tokens are associated with low gradient influence (negative correlation). This suggests that reasoning models have learned to be "structurally stable" even when facing high uncertainty.
For each generated token $t_i$, we compute the gradient influence by backpropagating from the target logit. Given an input sequence, the model produces logits for the next token. We select the logit corresponding to the actual generated token and perform backpropagation:
For each position $i$ in the sequence, we compute the entropy of the model's predictive distribution over the vocabulary. Given the context $x_{1:i-1}$, the model outputs logits which are converted to probabilities via softmax:
where $d_t = \text{rank}(\bar{I}_t) - \text{rank}(H_t)$ and $n$ is the number of unique tokens. A negative correlation ($\rho < 0$) indicates that high-entropy tokens have low gradient influence—the signature of reasoning models.
To study how reasoning capabilities emerge during training, we use the R1 pipeline consisting of two stages: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).
SFT trains the model to imitate high-quality reasoning trajectories. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}$ of prompt-response pairs, the model minimizes the negative log-likelihood:
We use open-r1/Mixture-of-Thoughts dataset for SFT training (8000 steps), saving checkpoints at regular intervals to track the evolution of gradient-entropy correlation.
GRPO is a reinforcement learning algorithm that eliminates the need for a separate critic model by using group-based advantage estimation. The algorithm proceeds as follows:
Group Sampling: For each prompt $q$, sample $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_{\theta_{\text{old}}}$.
Reward Computation: Evaluate each output using reward functions (accuracy, format).
Advantage Estimation: Compute normalized advantages using group statistics.
Policy Update: Update parameters using the clipped surrogate objective.
GRPO Objective: The policy is optimized by maximizing the clipped surrogate objective with a KL divergence penalty. Let $\rho_{i,t}$ denote the probability ratio:
3. Core Hypothesis: The Gradient-Entropy Inversion
3. 核心假设:熵与梯度的倒置
The authors analyze the projection layers (Q, K, V, O) and calculate the gradient matrix $G$ after backpropagation. To quantify the "strength" of the gradient, they introduce the Nuclear Norm $s_{X,i}$:
Before presenting our experimental results, we first introduce the complete DeepSeek-R1 training pipeline that we follow to train reasoning models and track the evolution of gradient-entropy correlation.
Pipeline Overview: The complete DeepSeek-R1 training pipeline. Starting from a base model, the pipeline alternates between SFT and RL stages. Our experiments track checkpoints at each stage to measure the gradient-entropy correlation evolution.流程概览: 完整的 DeepSeek-R1 训练流程。从基座模型开始,流程在 SFT 和 RL 阶段之间交替进行。我们的实验在每个阶段追踪检查点,以测量梯度-熵相关性的演化。
Tests across different datasets (Base, Safety, Reasoning samples) confirm that whenever a model possesses reasoning capabilities, the Logit Gradient negatively correlates with Entropy.
Key Data:
For the Llama Reasoning model, the Spearman correlation reaches -0.649, whereas the Base model is merely 0.036.
Figure 1: Spearman correlation between logit gradient nuclear norm and token entropy across different model types. Reasoning models exhibit strong negative correlation (-0.649), while Base and Safe models show weak or positive correlation.图 1: 不同模型类型的 logit 梯度核范数与 token 熵之间的 Spearman 相关系数。推理模型呈现强负相关(-0.649),而基座模型和安全模型呈现弱相关或正相关。
5. SFT Dynamics: The 200-Step Phase Transition
5. SFT 训练动力学:200 步的相变
Base Model基座模型Llama/Qwen
→
SFT TrainingSFT 训练8000 steps
→
Checkpoints检查点200, 400, ..., 2000
→
Measure ρ(G, H)计算 ρ(G, H)Gradient-Entropy
Pipeline 5.1: SFT experiment pipeline. We train base models with supervised fine-tuning on open-r1/Mixture-of-Thoughts and save checkpoints every 200 steps to track correlation changes.流程 5.1: SFT 实验流程。我们使用 open-r1/Mixture-of-Thoughts 对基座模型进行监督微调,每 200 步保存检查点以追踪相关性变化。
Using the open-r1/Mixture-of-Thoughts dataset, researchers tracked the correlation every 200 steps for 8000 steps total.
Discovery: The "Phase Transition" happens incredibly fast. Within the first 200 steps of SFT, the Llama model's correlation drops from 0.036 to -0.468. This suggests the structural pattern of reasoning is learned early.
Figure 2: Evolution of Spearman correlation during SFT training (first 2000 steps). The correlation drops sharply within the first 200 steps, indicating rapid acquisition of reasoning structure.图 2: SFT 训练过程中 Spearman 相关系数的演化(前 2000 步)。相关系数在前 200 步内急剧下降,表明模型快速习得了推理结构。
Pipeline 6.1: RL-Zero (Cold Start) experiment pipeline. Following DeepSeek-R1-Zero, we apply GRPO directly to base models without SFT warmup, using open-r1/OpenThoughts-114k-math with accuracy and format rewards.流程 6.1: RL-Zero(冷启动)实验流程。遵循 DeepSeek-R1-Zero,我们直接对基座模型应用 GRPO,无需 SFT 预热,使用 open-r1/OpenThoughts-114k-math 数据集,采用准确性和格式奖励。
Pure Reinforcement Learning (RL-Zero) with open-r1/OpenThoughts-114k-math for 8000 steps shows a different trajectory:
Early Oscillation: In the first 300 steps, models show instability. Llama's correlation rises to -0.318 at step 300 before dropping again.
Convergence: Despite early volatility, RL eventually converges to the same strong negative correlation (Llama RL-8000 reaches -0.649).
Figure 3: Evolution of Spearman correlation during RL-Zero training (first 1000 steps). Both models show early oscillation before stabilizing, with Llama exhibiting notable fluctuation around step 300.图 3: RL-Zero 训练过程中 Spearman 相关系数的演化(前 1000 步)。两个模型在稳定前都表现出早期震荡,Llama 在第 300 步附近波动明显。
7. Full Training Trajectory: SFT + RL (8000 Steps Each)
7. 完整训练轨迹:SFT + RL 各 8000 步
Base Model基座模型Llama/Qwen
→
SFT StageSFT 阶段0 → 8000 steps
→
RL Stage (GRPO)RL 阶段 (GRPO)8000 → 16000 steps
→
Final: ρ = -0.649最终: ρ = -0.649Strong negative
Pipeline 7.1: Complete R1 training pipeline combining SFT and RL stages. This follows the DeepSeek-R1 methodology: first SFT on reasoning data, then RL with GRPO to further enhance reasoning capabilities.流程 7.1: 结合 SFT 和 RL 阶段的完整 R1 训练流程。遵循 DeepSeek-R1 方法论:首先在推理数据上进行 SFT,然后使用 GRPO 进行 RL 以进一步增强推理能力。
The full R1 pipeline consists of SFT (8000 steps) followed by RL (8000 steps). Key observations:
SFT Phase: Llama reaches -0.606 at checkpoint-8000, Qwen reaches -0.494.
RL Phase: Both models continue to strengthen, reaching -0.649 at RL-8000.
Figure 4: Complete training trajectory of the R1 pipeline: SFT (steps 0-8000) followed by RL (steps 8000-16000). Both Llama and Qwen converge to strong negative correlation (-0.649) by the end of training.图 4: R1 Pipeline 完整训练轨迹:SFT(0-8000 步)后接 RL(8000-16000 步)。Llama 和 Qwen 在训练结束时均收敛至强负相关(-0.649)。
Acknowledgments
致谢
This work was completed by Junyao Yang during his internship at Shanghai Artificial Intelligence Laboratory. We would like to express our sincere gratitude to Dongrui Liu and Chen Qian, who served as Junyao Yang's mentors at SHAILAB, for their invaluable guidance, insightful discussions, and continuous support throughout this project.
[1] Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao. "Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning." arXiv preprint arXiv:2506.02867, 2025. https://arxiv.org/abs/2506.02867
[2] Shenzhi Wang, Le Yu, Chang Gao, et al. "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning." arXiv preprint arXiv:2506.01939, 2025. https://arxiv.org/abs/2506.01939
[3] Ming Li, Yanhong Li, Tianyi Zhou. "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective." arXiv preprint arXiv:2410.23743, 2025. https://arxiv.org/abs/2410.23743
[4] Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300
[5] Daya Guo, Dejian Yang, Haowei Zhang, et al. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature, 645(8081), 633–638, 2025. https://doi.org/10.1038/s41586-025-09422-z
Source: Explainable Reasoning Capability in a Token Logit Gradient Perspective 数据来源:Explainable Reasoning Capability in a Token Logit Gradient Perspective