home 个人主页

The Entropy-Gradient Inversion: A New Perspective on LLM Reasoning Capabilities

熵-梯度倒置:LLM推理能力的新视角

Junyao Yang
February 7, 2026
杨竣尧
2026年2月7号
Abstract: While Chain-of-Thought (CoT) and reasoning models (e.g., DeepSeek-R1, o1) have demonstrated remarkable capabilities, their internal mechanisms remain largely opaque. This study introduces a physics-inspired metric—the Nuclear Norm of Logit Gradients—to characterize long-chain reasoning. We discover that reasoning models exhibit a unique "fingerprint": a significant negative correlation between gradient strength and token entropy. This contradicts traditional base models. Furthermore, we find this capability emerges rapidly within the first 200 steps of SFT, whereas Reinforcement Learning (RL) exhibits early oscillation before convergence. 摘要: 尽管 Chain-of-Thought (CoT) 和推理模型(如 DeepSeek-R1, o1)展现了惊人的能力,但其内部运作机制在很大程度上仍是一个"黑盒"。本研究提出了一种基于物理学启发的指标——Logit 梯度的核范数 (Nuclear Norm of Logit Gradients),来表征模型的长思维链能力。研究发现,推理模型在生成过程中表现出一种独特的"指纹":梯度强度与 Token 熵之间呈显著负相关。这种相关性与传统基座模型截然相反。进一步的实验揭示了这种能力在 SFT 阶段的前 200 步内迅速形成,而在纯 RL (Cold Start) 阶段则表现出早期的震荡与随后的收敛。

1. Background: The Entropy of "Thinking"

1. 背景:思考的熵 (Entropy)

Before diving into the gradient analysis, it is crucial to understand the behavior of "thinking tokens" in modern reasoning models. Recent research provides two key insights:

The Conflict: While high entropy is a hallmark of deep reasoning, this new study reveals a surprising twist: for reasoning models, these high-entropy tokens are associated with low gradient influence (negative correlation). This suggests that reasoning models have learned to be "structurally stable" even when facing high uncertainty.

在深入分析梯度之前,理解现代推理模型中"思考Token"的行为至关重要。结合最新的两篇研究,我们可以得出两个关键见解:

冲突与发现: 虽然高熵是深度推理的标志,但本研究揭示了一个惊人的反转:对于推理模型而言,这些高熵Token与低梯度影响(负相关)相关联。这表明推理模型已经学会了即使在面对高不确定性时,也能保持"结构上的稳定性"。

2. Methodology

2. 方法论

2.1 Computing Token Gradient Influence

2.1 计算 Token 梯度影响力

For each generated token $t_i$, we compute the gradient influence by backpropagating from the target logit. Given an input sequence, the model produces logits for the next token. We select the logit corresponding to the actual generated token and perform backpropagation:

对于每个生成的 token $t_i$,我们通过从目标 logit 进行反向传播来计算梯度影响力。给定输入序列,模型产生下一个 token 的 logits。我们选择与实际生成 token 对应的 logit 并执行反向传播:

$$\text{logit}_{t_i} = f_\theta(x_{1:i-1})[t_i]$$ $$\nabla_\theta \text{logit}_{t_i} \rightarrow G_l \quad \text{for each layer } l$$

The gradient influence for each layer is computed as the L1 norm of the gradient:

每层的梯度影响力通过梯度的 L1 范数计算:

$$I_l = \sum_{p \in \theta_l} |G_p|$$

The mean influence across all layers represents the overall gradient strength for that token:

所有层的平均影响力代表该 token 的整体梯度强度:

$$\bar{I}_{t_i} = \frac{1}{L} \sum_{l=1}^{L} I_l$$

2.2 Computing Token Entropy

2.2 计算 Token 熵

For each position $i$ in the sequence, we compute the entropy of the model's predictive distribution over the vocabulary. Given the context $x_{1:i-1}$, the model outputs logits which are converted to probabilities via softmax:

对于序列中的每个位置 $i$,我们计算模型在词汇表上预测分布的熵。给定上下文 $x_{1:i-1}$,模型输出 logits,通过 softmax 转换为概率:

$$P(v | x_{1:i-1}) = \text{softmax}(f_\theta(x_{1:i-1}))$$

The entropy (in bits) is then computed as:

然后计算(以比特为单位):

$$H_i = -\sum_{v \in V} P(v | x_{1:i-1}) \cdot \log_2 P(v | x_{1:i-1})$$

Higher entropy indicates greater uncertainty in the model's prediction, often associated with reasoning or decision points.

更高的熵表示模型预测的不确定性更大,通常与推理或决策点相关。

2.3 Computing Gradient-Entropy Correlation

2.3 计算梯度-熵相关性

After collecting gradient influences $\{\bar{I}_{t}\}$ and entropies $\{H_t\}$ for each unique token $t$, we compute two correlation metrics:

在收集每个唯一 token $t$ 的梯度影响力 $\{\bar{I}_{t}\}$ 和熵 $\{H_t\}$ 后,我们计算两种相关性指标:

Pearson Correlation measures linear relationship:

皮尔逊相关系数衡量线性关系:

$$r = \frac{\sum_{t}(\bar{I}_t - \mu_I)(H_t - \mu_H)}{\sqrt{\sum_{t}(\bar{I}_t - \mu_I)^2} \cdot \sqrt{\sum_{t}(H_t - \mu_H)^2}}$$

Spearman Correlation measures monotonic relationship based on ranks:

斯皮尔曼相关系数基于秩次衡量单调关系:

$$\rho = 1 - \frac{6 \sum_{t} d_t^2}{n(n^2 - 1)}$$

where $d_t = \text{rank}(\bar{I}_t) - \text{rank}(H_t)$ and $n$ is the number of unique tokens. A negative correlation ($\rho < 0$) indicates that high-entropy tokens have low gradient influence—the signature of reasoning models.

其中 $d_t = \text{rank}(\bar{I}_t) - \text{rank}(H_t)$,$n$ 是唯一 token 的数量。负相关($\rho < 0$)表示高熵 token 具有低梯度影响力——这是推理模型的特征标志。

2.4 Training Pipeline: SFT and GRPO

2.4 训练流程:SFT 与 GRPO

To study how reasoning capabilities emerge during training, we use the R1 pipeline consisting of two stages: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).

为了研究推理能力在训练过程中如何产生,我们使用 R1 Pipeline,包含两个阶段:监督微调 (SFT) 和随后的 组相对策略优化 (GRPO)

2.4.1 Supervised Fine-Tuning (SFT)

2.4.1 监督微调 (SFT)

SFT trains the model to imitate high-quality reasoning trajectories. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}$ of prompt-response pairs, the model minimizes the negative log-likelihood:

SFT 训练模型模仿高质量的推理轨迹。给定提示-响应对数据集 $\mathcal{D} = \{(x_i, y_i)\}$,模型最小化负对数似然:

$$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t}) \right]$$

We use open-r1/Mixture-of-Thoughts dataset for SFT training (8000 steps), saving checkpoints at regular intervals to track the evolution of gradient-entropy correlation.

我们使用 open-r1/Mixture-of-Thoughts 数据集进行 SFT 训练(8000 步),定期保存检查点以追踪梯度-熵相关性的演化。

2.4.2 Group Relative Policy Optimization (GRPO)

2.4.2 组相对策略优化 (GRPO)

GRPO is a reinforcement learning algorithm that eliminates the need for a separate critic model by using group-based advantage estimation. The algorithm proceeds as follows:

  1. Group Sampling: For each prompt $q$, sample $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_{\theta_{\text{old}}}$.
  2. Reward Computation: Evaluate each output using reward functions (accuracy, format).
  3. Advantage Estimation: Compute normalized advantages using group statistics.
  4. Policy Update: Update parameters using the clipped surrogate objective.

GRPO 是一种强化学习算法,通过使用基于组的优势估计来消除对单独 critic 模型的需求。算法流程如下:

  1. 组采样: 对于每个提示 $q$,从当前策略 $\pi_{\theta_{\text{old}}}$ 采样 $G$ 个输出 $\{o_1, o_2, \ldots, o_G\}$。
  2. 奖励计算: 使用奖励函数(准确性、格式)评估每个输出。
  3. 优势估计: 使用组统计量计算归一化优势。
  4. 策略更新: 使用裁剪代理目标更新参数。

Advantage Estimation: Instead of using a value network, GRPO estimates the advantage by normalizing rewards within each group:

优势估计: GRPO 不使用价值网络,而是通过在每个组内归一化奖励来估计优势:

$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

GRPO Objective: The policy is optimized by maximizing the clipped surrogate objective with a KL divergence penalty. Let $\rho_{i,t}$ denote the probability ratio:

GRPO 目标函数: 通过最大化带有 KL 散度惩罚的裁剪代理目标来优化策略。令 $\rho_{i,t}$ 表示概率比:

$$\rho_{i,t} = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}$$

The clipped surrogate loss for each token is:

每个 token 的裁剪代理损失为:

$$L_{i,t} = \min\left( \rho_{i,t} \cdot \hat{A}_{i,t}, \; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_{i,t} \right)$$

The complete GRPO objective is:

完整的 GRPO 目标函数为:

$$\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ L_{i,t} - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]$$

where $\epsilon$ is the clipping parameter and $\beta$ controls the KL penalty strength.

其中 $\epsilon$ 是裁剪参数,$\beta$ 控制 KL 惩罚强度。

2.4.3 Reward Functions

2.4.3 奖励函数

We use verifiable rewards combining multiple components:

我们使用结合多个组件的可验证奖励:

$$r_i = \alpha \cdot r_{\text{accuracy}} + \beta \cdot r_{\text{format}}$$

We use open-r1/OpenThoughts-114k-math dataset for GRPO training (8000 steps), enabling us to track how RL shapes the gradient-entropy relationship.

我们使用 open-r1/OpenThoughts-114k-math 数据集进行 GRPO 训练(8000 步),使我们能够追踪 RL 如何塑造梯度-熵关系。

3. Core Hypothesis: The Gradient-Entropy Inversion

3. 核心假设:熵与梯度的倒置

The authors analyze the projection layers (Q, K, V, O) and calculate the gradient matrix $G$ after backpropagation. To quantify the "strength" of the gradient, they introduce the Nuclear Norm $s_{X,i}$:

研究团队关注模型输出层的投影矩阵 (Q, K, V, O),并在反向传播后计算每一层的梯度矩阵 $G$。为了量化梯度的"强度",论文引入了核范数 (Nuclear Norm) $s_{X,i}$:

$$s_{X,i} = ||G_{X,i}||_{*} = \sum_{j} |\sigma_{j}|$$

By analyzing the relationship between this gradient strength and token Entropy, the following patterns emerge:

通过分析该梯度强度与模型生成 Token 时的熵 (Entropy)之间的关系,研究发现了以下规律:

4. The "Reasoning Fingerprint": Llama vs. Qwen

4. 推理能力的"指纹":Llama 与 Qwen 的对比实验

Before presenting our experimental results, we first introduce the complete DeepSeek-R1 training pipeline that we follow to train reasoning models and track the evolution of gradient-entropy correlation.

在展示实验结果之前,我们首先介绍我们遵循的完整 DeepSeek-R1 训练流程,用于训练推理模型并追踪梯度-熵相关性的演化。

Base Model 基座模型 Llama/Qwen
Cold Start SFT 冷启动 SFT Long CoT Data
RL Stage 1 RL 阶段 1 GRPO + Rule Reward
Rejection Sampling 拒绝采样 Filter + Refine
RL Stage 2 RL 阶段 2 Diverse Prompts
SFT Data: open-r1/Mixture-of-ThoughtsSFT 数据: open-r1/Mixture-of-Thoughts
RL Data: open-r1/OpenThoughts-114k-mathRL 数据: open-r1/OpenThoughts-114k-math
Pipeline Overview: The complete DeepSeek-R1 training pipeline. Starting from a base model, the pipeline alternates between SFT and RL stages. Our experiments track checkpoints at each stage to measure the gradient-entropy correlation evolution. 流程概览: 完整的 DeepSeek-R1 训练流程。从基座模型开始,流程在 SFT 和 RL 阶段之间交替进行。我们的实验在每个阶段追踪检查点,以测量梯度-熵相关性的演化。

Tests across different datasets (Base, Safety, Reasoning samples) confirm that whenever a model possesses reasoning capabilities, the Logit Gradient negatively correlates with Entropy.

Key Data: For the Llama Reasoning model, the Spearman correlation reaches -0.649, whereas the Base model is merely 0.036.

研究表明,无论在什么领域的数据上,只要模型具备 Reasoning 能力,其 Logit Gradient 与 Entropy 之间总是呈负相关

关键数据: 对于 Llama Reasoning 模型,在推理样本上的 Spearman 相关系数高达 -0.649,而 Base 模型仅为 0.036
Figure 1: Spearman correlation between logit gradient nuclear norm and token entropy across different model types. Reasoning models exhibit strong negative correlation (-0.649), while Base and Safe models show weak or positive correlation. 图 1: 不同模型类型的 logit 梯度核范数与 token 熵之间的 Spearman 相关系数。推理模型呈现强负相关(-0.649),而基座模型和安全模型呈现弱相关或正相关。

5. SFT Dynamics: The 200-Step Phase Transition

5. SFT 训练动力学:200 步的相变

Base Model 基座模型 Llama/Qwen
SFT Training SFT 训练 8000 steps
Checkpoints 检查点 200, 400, ..., 2000
Measure ρ(G, H) 计算 ρ(G, H) Gradient-Entropy
Pipeline 5.1: SFT experiment pipeline. We train base models with supervised fine-tuning on open-r1/Mixture-of-Thoughts and save checkpoints every 200 steps to track correlation changes. 流程 5.1: SFT 实验流程。我们使用 open-r1/Mixture-of-Thoughts 对基座模型进行监督微调,每 200 步保存检查点以追踪相关性变化。

Using the open-r1/Mixture-of-Thoughts dataset, researchers tracked the correlation every 200 steps for 8000 steps total.

Discovery: The "Phase Transition" happens incredibly fast. Within the first 200 steps of SFT, the Llama model's correlation drops from 0.036 to -0.468. This suggests the structural pattern of reasoning is learned early.

为了探究这种推理能力是何时形成的,研究者使用 open-r1/Mixture-of-Thoughts 数据集进行了 8000 步的 SFT 训练。

惊人的发现: 推理能力的"相变"发生得极快。在 SFT 的前 200 步,Llama 模型的相关系数从 Base 的 0.036 骤降至 -0.468。这表明模型在极早期就习得了推理的结构化模式。

Figure 2: Evolution of Spearman correlation during SFT training (first 2000 steps). The correlation drops sharply within the first 200 steps, indicating rapid acquisition of reasoning structure. 图 2: SFT 训练过程中 Spearman 相关系数的演化(前 2000 步)。相关系数在前 200 步内急剧下降,表明模型快速习得了推理结构。

6. RL (Cold Start) Dynamics: Oscillation & Convergence

6. RL (Cold Start) 动力学:震荡与收敛

Base Model 基座模型 Llama/Qwen
RL-Zero (GRPO) RL-Zero (GRPO) No SFT warmup
Checkpoints 检查点 100, 200, ..., 1000
Measure ρ(G, H) 计算 ρ(G, H) Gradient-Entropy
Pipeline 6.1: RL-Zero (Cold Start) experiment pipeline. Following DeepSeek-R1-Zero, we apply GRPO directly to base models without SFT warmup, using open-r1/OpenThoughts-114k-math with accuracy and format rewards. 流程 6.1: RL-Zero(冷启动)实验流程。遵循 DeepSeek-R1-Zero,我们直接对基座模型应用 GRPO,无需 SFT 预热,使用 open-r1/OpenThoughts-114k-math 数据集,采用准确性和格式奖励。

Pure Reinforcement Learning (RL-Zero) with open-r1/OpenThoughts-114k-math for 8000 steps shows a different trajectory:

使用 open-r1/OpenThoughts-114k-math 数据集进行 8000 步的纯强化学习(RL-Zero)表现出了不同的动力学特征:

Figure 3: Evolution of Spearman correlation during RL-Zero training (first 1000 steps). Both models show early oscillation before stabilizing, with Llama exhibiting notable fluctuation around step 300. 图 3: RL-Zero 训练过程中 Spearman 相关系数的演化(前 1000 步)。两个模型在稳定前都表现出早期震荡,Llama 在第 300 步附近波动明显。

7. Full Training Trajectory: SFT + RL (8000 Steps Each)

7. 完整训练轨迹:SFT + RL 各 8000 步

Base Model 基座模型 Llama/Qwen
SFT Stage SFT 阶段 0 → 8000 steps
RL Stage (GRPO) RL 阶段 (GRPO) 8000 → 16000 steps
Final: ρ = -0.649 最终: ρ = -0.649 Strong negative
Pipeline 7.1: Complete R1 training pipeline combining SFT and RL stages. This follows the DeepSeek-R1 methodology: first SFT on reasoning data, then RL with GRPO to further enhance reasoning capabilities. 流程 7.1: 结合 SFT 和 RL 阶段的完整 R1 训练流程。遵循 DeepSeek-R1 方法论:首先在推理数据上进行 SFT,然后使用 GRPO 进行 RL 以进一步增强推理能力。

The full R1 pipeline consists of SFT (8000 steps) followed by RL (8000 steps). Key observations:

完整的 R1 Pipeline 包含 SFT(8000步)和随后的 RL(8000步)。关键观察:

Figure 4: Complete training trajectory of the R1 pipeline: SFT (steps 0-8000) followed by RL (steps 8000-16000). Both Llama and Qwen converge to strong negative correlation (-0.649) by the end of training. 图 4: R1 Pipeline 完整训练轨迹:SFT(0-8000 步)后接 RL(8000-16000 步)。Llama 和 Qwen 在训练结束时均收敛至强负相关(-0.649)。

Acknowledgments

致谢

This work was completed by Junyao Yang during his internship at Shanghai Artificial Intelligence Laboratory. We would like to express our sincere gratitude to Dongrui Liu and Chen Qian, who served as Junyao Yang's mentors at SHAILAB, for their invaluable guidance, insightful discussions, and continuous support throughout this project.

本工作由杨竣尧在上海人工智能实验室实习期间完成。我们衷心感谢刘东瑞钱辰作为杨竣尧在上海人工智能实验室的导师,在整个项目过程中提供的宝贵指导、深入讨论和持续支持。


References

参考文献

[1] Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao. "Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning." arXiv preprint arXiv:2506.02867, 2025. https://arxiv.org/abs/2506.02867

[2] Shenzhi Wang, Le Yu, Chang Gao, et al. "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning." arXiv preprint arXiv:2506.01939, 2025. https://arxiv.org/abs/2506.01939

[3] Ming Li, Yanhong Li, Tianyi Zhou. "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective." arXiv preprint arXiv:2410.23743, 2025. https://arxiv.org/abs/2410.23743

[4] Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300

[5] Daya Guo, Dejian Yang, Haowei Zhang, et al. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature, 645(8081), 633–638, 2025. https://doi.org/10.1038/s41586-025-09422-z


Source: Explainable Reasoning Capability in a Token Logit Gradient Perspective 数据来源:Explainable Reasoning Capability in a Token Logit Gradient Perspective


💬 Comments 💬 评论