The Entropy-Gradient Inversion

Abstract: While Chain-of-Thought (CoT) and reasoning models (e.g., DeepSeek-R1, o1) have demonstrated remarkable capabilities, their internal mechanisms remain largely opaque. This study introduces a physics-inspired metric—the Nuclear Norm of Logit Gradients—to characterize long-chain reasoning. We discover that reasoning models exhibit a unique "fingerprint": a significant negative correlation between gradient strength and token entropy. This contradicts traditional base models. Furthermore, we find this capability emerges rapidly within the first 200 steps of SFT, whereas Reinforcement Learning (RL) exhibits early oscillation before convergence. 摘要： 尽管 Chain-of-Thought (CoT) 和推理模型（如 DeepSeek-R1, o1）展现了惊人的能力，但其内部运作机制在很大程度上仍是一个"黑盒"。本研究提出了一种基于物理学启发的指标——Logit 梯度的核范数 (Nuclear Norm of Logit Gradients)，来表征模型的长思维链能力。研究发现，推理模型在生成过程中表现出一种独特的"指纹"：梯度强度与 Token 熵之间呈显著负相关。这种相关性与传统基座模型截然相反。进一步的实验揭示了这种能力在 SFT 阶段的前 200 步内迅速形成，而在纯 RL (Cold Start) 阶段则表现出早期的震荡与随后的收敛。

1. Background: The Entropy of "Thinking"

1. 背景：思考的熵 (Entropy)

Before diving into the gradient analysis, it is crucial to understand the behavior of "thinking tokens" in modern reasoning models. Recent research provides two key insights:

Thinking Tokens are Information Peaks: According to "Demystifying Reasoning Dynamics with Mutual Information", reasoning steps—such as "Let", "Suppose", or "However"—often manifest as high-entropy peaks. These tokens represent moments where the model branches out to explore complex logical paths, distinguishing them from the low-entropy "regular" tokens used in standard text generation.
High-Entropy Minority Drives Learning: The study "Beyond the 80/20 Rule" highlights that while 80% of tokens are low-entropy, the "high-entropy minority" (the remaining 20%) are the critical drivers for effective Reinforcement Learning in reasoning tasks.

The Conflict: While high entropy is a hallmark of deep reasoning, this new study reveals a surprising twist: for reasoning models, these high-entropy tokens are associated with low gradient influence (negative correlation). This suggests that reasoning models have learned to be "structurally stable" even when facing high uncertainty.

在深入分析梯度之前，理解现代推理模型中"思考Token"的行为至关重要。结合最新的两篇研究，我们可以得出两个关键见解：

思考Token是信息峰值： 根据 "Demystifying Reasoning Dynamics with Mutual Information"，推理步骤——例如 "Let", "Suppose", 或 "However"——通常表现为高熵（High Entropy）的信息峰值。这些Token代表了模型在探索复杂逻辑路径时的分支点，与普通文本生成中的低熵Token形成鲜明对比。
高熵少数派驱动学习： 研究 "Beyond the 80/20 Rule" 指出，虽然 80% 的Token是低熵的，但剩下的 20% "高熵少数派"才是驱动推理任务中有效强化学习的关键因素。

冲突与发现： 虽然高熵是深度推理的标志，但本研究揭示了一个惊人的反转：对于推理模型而言，这些高熵Token与低梯度影响（负相关）相关联。这表明推理模型已经学会了即使在面对高不确定性时，也能保持"结构上的稳定性"。

2. Methodology

2. 方法论

2.1 Computing Token Gradient Influence

2.1 计算 Token 梯度影响力

For each generated token $t_i$, we compute the gradient influence by backpropagating from the target logit. Given an input sequence, the model produces logits for the next token. We select the logit corresponding to the actual generated token and perform backpropagation:

对于每个生成的 token $t_i$，我们通过从目标 logit 进行反向传播来计算梯度影响力。给定输入序列，模型产生下一个 token 的 logits。我们选择与实际生成 token 对应的 logit 并执行反向传播：

2.2 Computing Token Entropy

2.2 计算 Token 熵

For each position $i$ in the sequence, we compute the entropy of the model's predictive distribution over the vocabulary. Given the context $x_{1:i-1}$, the model outputs logits which are converted to probabilities via softmax:

2.3 Computing Gradient-Entropy Correlation

2.3 计算梯度-熵相关性

where $d_t = \text{rank}(\bar{I}_t) - \text{rank}(H_t)$ and $n$ is the number of unique tokens. A negative correlation ($\rho < 0$) indicates that high-entropy tokens have low gradient influence—the signature of reasoning models.

2.4 Training Pipeline: SFT and GRPO

2.4 训练流程：SFT 与 GRPO

2.4.1 Supervised Fine-Tuning (SFT)

2.4.1 监督微调 (SFT)

We use open-r1/Mixture-of-Thoughts dataset for SFT training (8000 steps), saving checkpoints at regular intervals to track the evolution of gradient-entropy correlation.

我们使用 open-r1/Mixture-of-Thoughts 数据集进行 SFT 训练（8000 步），定期保存检查点以追踪梯度-熵相关性的演化。

2.4.2 Group Relative Policy Optimization (GRPO)

2.4.2 组相对策略优化 (GRPO)

GRPO is a reinforcement learning algorithm that eliminates the need for a separate critic model by using group-based advantage estimation. The algorithm proceeds as follows:

Group Sampling: For each prompt $q$, sample $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_{\theta_{\text{old}}}$.
Reward Computation: Evaluate each output using reward functions (accuracy, format).
Advantage Estimation: Compute normalized advantages using group statistics.
Policy Update: Update parameters using the clipped surrogate objective.

GRPO 是一种强化学习算法，通过使用基于组的优势估计来消除对单独 critic 模型的需求。算法流程如下：

组采样： 对于每个提示 $q$，从当前策略 $\pi_{\theta_{\text{old}}}$ 采样 $G$ 个输出 $\{o_1, o_2, \ldots, o_G\}$。
奖励计算： 使用奖励函数（准确性、格式）评估每个输出。
优势估计： 使用组统计量计算归一化优势。
策略更新： 使用裁剪代理目标更新参数。

2.4.3 Reward Functions

2.4.3 奖励函数

We use verifiable rewards combining multiple components:

Accuracy Reward: Checks if the model's final answer matches the ground truth using symbolic verification.
Format Reward: Verifies that responses follow the expected structure (e.g., <think>...</think><answer>...</answer>).

我们使用结合多个组件的可验证奖励：

准确性奖励： 使用符号验证检查模型的最终答案是否与标准答案匹配。
格式奖励： 验证响应是否遵循预期结构（例如，<think>...</think><answer>...</answer>）。

We use open-r1/OpenThoughts-114k-math dataset for GRPO training (8000 steps), enabling us to track how RL shapes the gradient-entropy relationship.

我们使用 open-r1/OpenThoughts-114k-math 数据集进行 GRPO 训练（8000 步），使我们能够追踪 RL 如何塑造梯度-熵关系。

3. Core Hypothesis: The Gradient-Entropy Inversion

3. 核心假设：熵与梯度的倒置

By analyzing the relationship between this gradient strength and token Entropy, the following patterns emerge:

− Base/Safe Models: Exhibit a Positive Correlation. High entropy (uncertainty) triggers high gradient updates.
✔ Reasoning Models: Exhibit a significant Negative Correlation. This "Inversion" is the signature of reasoning capabilities.

通过分析该梯度强度与模型生成 Token 时的熵 (Entropy)之间的关系，研究发现了以下规律：

− Base/Safe Models： 呈现正相关。模型在不确定时（高熵），往往伴随着更大的梯度更新需求。
✔ Reasoning Models： 呈现显著的负相关。这种"倒置"是推理能力的标志。

4. The "Reasoning Fingerprint": Llama vs. Qwen

4. 推理能力的"指纹"：Llama 与 Qwen 的对比实验

Tests across different datasets (Base, Safety, Reasoning samples) confirm that whenever a model possesses reasoning capabilities, the Logit Gradient negatively correlates with Entropy.

             Key Data: 
            For the Llama Reasoning model, the Spearman correlation reaches -0.649, whereas the Base model is merely 0.036.
        

研究表明，无论在什么领域的数据上，只要模型具备 Reasoning 能力，其 Logit Gradient 与 Entropy 之间总是呈负相关。

             关键数据： 
            对于 Llama Reasoning 模型，在推理样本上的 Spearman 相关系数高达 -0.649，而 Base 模型仅为 0.036。
        

5. SFT Dynamics: The 200-Step Phase Transition

5. SFT 训练动力学：200 步的相变

Using the open-r1/Mixture-of-Thoughts dataset, researchers tracked the correlation every 200 steps for 8000 steps total.

Discovery: The "Phase Transition" happens incredibly fast. Within the first 200 steps of SFT, the Llama model's correlation drops from 0.036 to -0.468. This suggests the structural pattern of reasoning is learned early.

为了探究这种推理能力是何时形成的，研究者使用 open-r1/Mixture-of-Thoughts 数据集进行了 8000 步的 SFT 训练。

惊人的发现： 推理能力的"相变"发生得极快。在 SFT 的前 200 步，Llama 模型的相关系数从 Base 的 0.036 骤降至 -0.468。这表明模型在极早期就习得了推理的结构化模式。

6. RL (Cold Start) Dynamics: Oscillation & Convergence

6. RL (Cold Start) 动力学：震荡与收敛

Pure Reinforcement Learning (RL-Zero) with open-r1/OpenThoughts-114k-math for 8000 steps shows a different trajectory:

Early Oscillation: In the first 300 steps, models show instability. Llama's correlation rises to -0.318 at step 300 before dropping again.
Convergence: Despite early volatility, RL eventually converges to the same strong negative correlation (Llama RL-8000 reaches -0.649).

使用 open-r1/OpenThoughts-114k-math 数据集进行 8000 步的纯强化学习（RL-Zero）表现出了不同的动力学特征：

初期震荡： 在前 300 步，模型表现出不稳定性。Llama 在 RL-300 时相关系数有所回升（-0.318），随后才再次下降。
最终收敛： 尽管初期不稳定，RL 最终也收敛到了强负相关状态（Llama RL-8000 达到 -0.649）。

7. Full Training Trajectory: SFT + RL (8000 Steps Each)

7. 完整训练轨迹：SFT + RL 各 8000 步

The full R1 pipeline consists of SFT (8000 steps) followed by RL (8000 steps). Key observations:

SFT Phase: Llama reaches -0.606 at checkpoint-8000, Qwen reaches -0.494.
RL Phase: Both models continue to strengthen, reaching -0.649 at RL-8000.

完整的 R1 Pipeline 包含 SFT（8000步）和随后的 RL（8000步）。关键观察：

SFT 阶段： Llama 在 checkpoint-8000 达到 -0.606，Qwen 达到 -0.494。
RL 阶段： 两个模型继续增强，在 RL-8000 均达到 -0.649。

Acknowledgments

致谢

This work was completed by Junyao Yang during his internship at Shanghai Artificial Intelligence Laboratory. We would like to express our sincere gratitude to Dongrui Liu and Chen Qian, who served as Junyao Yang's mentors at SHAILAB, for their invaluable guidance, insightful discussions, and continuous support throughout this project.

本工作由杨竣尧在上海人工智能实验室实习期间完成。我们衷心感谢刘东瑞和钱辰作为杨竣尧在上海人工智能实验室的导师，在整个项目过程中提供的宝贵指导、深入讨论和持续支持。

References

参考文献

[1] Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao. "Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning." arXiv preprint arXiv:2506.02867, 2025. https://arxiv.org/abs/2506.02867

[2] Shenzhi Wang, Le Yu, Chang Gao, et al. "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning." arXiv preprint arXiv:2506.01939, 2025. https://arxiv.org/abs/2506.01939

[3] Ming Li, Yanhong Li, Tianyi Zhou. "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective." arXiv preprint arXiv:2410.23743, 2025. https://arxiv.org/abs/2410.23743

[4] Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300

[5] Daya Guo, Dejian Yang, Haowei Zhang, et al. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature, 645(8081), 633–638, 2025. https://doi.org/10.1038/s41586-025-09422-z

Source: Explainable Reasoning Capability in a Token Logit Gradient Perspective 数据来源：Explainable Reasoning Capability in a Token Logit Gradient Perspective

The Entropy-Gradient Inversion: A New Perspective on LLM Reasoning Capabilities

熵-梯度倒置：LLM推理能力的新视角