d2: Improved Techinques for Training Reasonoing Diffusion Language Models

1Cornell Cornell
arXiv Code

TL;DR: An efficient and effective algorithm to improve reasoning post-training for diffusion language models via reinforcement learning

Caduceus
Figure 1. Our reinforcement learning (RL) post-training framework for diffusion language models, d2, consistently outperforms previous baselines, such as d1 ( Zhao et al., 2025), and wd1 ( Tang et al., 2025) across four math and logical reasoning benchmarks. While previous methods, such as d1, relies on chain-of-thought supervised finetuning (SFT) to enhance reasoning performance, d2 is able to outperform them without SFT, verifying the efficacy of our proposed RL algorithm.


Our contributions

  • We derive a novel GRPO-style RL policy gradient algorithm for masked DLMs, highlighting the importance of accurate trajectory likelihood estimates for training reasoning diffusion language models.
  • We introduce d2-AnyOrder, which enables unbiased one-pass likelihood estimates of sample trajectories in DLMs that support any-order decoding. Empirically, d2-AnyOrder significantly outperforms widely-used RL baselines such as diffu-GRPO ( Zhao et al., 2025) and DDPO ( Black et al., 2023).
  • For DLMs that do not naturally support any-order decoding, we propose d2-StepMerge to provide practical likelihood approximation. When applied to LLaDA-8B-Instruct, d2-StepMerge achieves state-of-the-art reasoning results on Sudoku, Countdown, GSM8K, and MATH500, without relying on supervised fine-tuning.

Motivation

Masked diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models for language generation. Yet, while reinforcement learning (RL) has become the de-facto approach for inducing reasoning in AR LLMs, post-training DLMs using RL remains an active research area due to the complexity of DLM policies.

In this project, we study the problem of RL for DLMs in a principled way. Concretely, we first dervie the GRPO objective of masked DLMs starting from the policy gradient estimation. We then propose practical approaches for conducting RL on masked DLMs inspired by our theorectical derviation, which we find to work well empirically compared to previous RL baselines.

Deriving the GRPO objective for masked DLMs

We start our derivation by defining the policy gradient objective for a masked DLM policy \(\pi_\theta\) as: $$ \nabla_\theta\mathcal{J}(\theta) = \nabla_\theta \mathbb{E}_{\mathbf{x}_0^{1:L}\sim\pi_\theta(\cdot\mid\mathbf{q})}\Big[r(\mathbf{x}_0^{1:L}, \mathbf{q})\Big]. $$ Here, \(\mathbf{q}\) denotes the prompt sequence, \(r\) represents the reward funciton, and \(\mathbf{x}_0^{1:L}\) denotes the \(L\) sample tokens generated by the model. Based on our derivation (see the full derivation chain in our paper), the GRPO objective for masked DLMs is given by: $$ -\mathbb{E}_{\mathbf{x}_0^{1:L}\sim\pi_{\textnormal{old}}} \Big[\sum_{t=0}^{T-1}\frac{1}{L}\sum_{l=1}^L\mathbf{1}_{t,l}\textnormal{min}(\rho_t^lA^l, \textnormal{clip}(\rho_t^l, 1-\epsilon, 1+\epsilon)A^l) +\beta D_{\mathrm{KL}}\Big(\pi_\theta(\mathbf{x}_{0:T}^{1:L}\mid \mathbf{q})\Vert \pi_{\textnormal{ref}}(\mathbf{x}_{0:T}^{1:L}\mid \mathbf{q}))\Big)\Big]. $$ where \(A^l\) denotes the advantage values of the corresponding sample sequence, \(\rho_t^l=\frac{\pi_\theta(\mathbf{x}_t^l\mid\mathbf{x}_{t+1}^{1:L}, \mathbf{q})}{\pi_{\textnormal{old}}(\mathbf{x}_t^l \mid \mathbf{x}_{t+1}^{1:L}, \mathbf{q})}\) denotes the importance sampling weight, and \(\mathbf{1}_{t,l}\) is the indicator of tokens decoded at time step \(t\), i.e., \(\mathbf{1}_{t,l}=\mathbf{1}_{\mathbf{x}_{t+1}^l=\mathbf{m}, \mathbf{x}_t^l\neq\mathbf{m}}\).

To compute this objective, we need to estimate the trajectory likelihood of a sequence. In other words, for each token in a sample squence, we should compute its likelihood at its decoding time step. However, for standard masked DLMs such as LLaDA, computing the trajectory likelihood of a sample sequence decoded in \(T\) time steps takes \(T\) model forward passes, rendering it computationally prohibitive. Based on this observation, we propose several pratical techinques to compute masked DLMs' trajectory likelihood in an efficient and effective way.

d2-AnyOrder: an unbiased, one-model pass trajectory likelihood estimator

Any-order autoregressive models


In this section, we introduce d2-AnyOrder, our trajectory likelihood estimator that can achieve unbiased trajectory likelihood evaluation for masked DLMs with only one model pass. Since this estimator is inspired by the concept of any-order autoregressive models (AO-ARMs), we first take a detour to introduce AO-ARMs.

The concept of AO-ARM is proposed in Hoogeboom et al., 2021, where they demonstrate that masked discrete diffusion models' training objective is equivalent to an any-order autoregressive variant in the following form: $$ \mathcal{L}_{\textnormal{AO-ARM}}=\mathbb{E}_{\sigma \sim U(S_D)}\Big[\sum_{l=1}^L \log p_\theta(\mathbf{x}_0^{\sigma(l)} \mid \mathbf{x}_0^{\sigma(< l)})\Big], $$ where \(\sigma\) is a permutation of intergers \(1,\ldots,L\), and \(S_D\) is the set of all possible \(\sigma\).

d2-AnyOrder


Inspired by the concept of AO-ARM, we propose that the trajectory likelihood \(\pi(\mathbf{x}_{0:T}^{1:L})\) can be rewritten as \(\prod_{l=1}^L\pi(\mathbf{x}_0^{\sigma(l)}\mid \mathbf{x}_0^{\sigma(\lt l)})\). Here, instead of a randomly sampled permutation, \(\sigma\) records the decoding order of the sample tokens. In other words, \(\mathbf{x}_0^{\sigma(\lt l)}\) denotes the tokens decoded before \(\mathbf{x}_0^l\).

Based on this alternative decomposition, we propose our efficient and effective trajectory likelihood estimator, d2-AnyOrder. Concretely, given a clean sample token sequence \(\mathbf{x}_0^{1:L}\), we construct a \(2L\)-length sequence \(\mathbf{x}_0^{1:L} \oplus \mathbf{m}^{L+1:2L}\), where \(\oplus\) denotes concatenation along the sequence dimension and \(\mathbf{m}^{L+1:2L}\) are masked tokens. We assign the positional encoding as \(pos_l = l \; \textnormal{mod} \; L\). We then define the attention mask so that a clean token \(\mathbf{x}_0^{\sigma(l)}\) attends to \(\mathbf{x}_0^{\sigma(\leq l)}\), and a mask token \(\mathbf{m}^{L+\sigma(l)}\) attends to \(\mathbf{x}_0^{\sigma(\lt l)} \cup \mathbf{m}^{L+\sigma(l)}\). We then use the output logits of position \(L+l\) as a proxy of the logits of token \(\mathbf{x}_0^l\) at its decoding time step.

Caduceus
Figure 2. Illustration of d2-AnyOrder, our one-shot, unbiased trajectory likelihood estimator. We depict attention with query tokens (one layer up) attending to keys / values (one layer below) via an undirected connected line. The output at each position is depicted with a directed arrow. “pos” refers to positional encoding index. We use a three token example where the decoding order is ”for→d2→RL”.

Denoting the resulting likelihood estimate as \(\pi^{\textnormal{AO}}(\mathbf{x}_0^l \mid \mathbf{x}_0^{1:L} \oplus \mathbf{m}^{L+1:2L})\), we train the policy network with the following GRPO objective: $$ \mathbb{E}_{\mathbf{x}_{0:T}^{1:L}\sim\pi_{\textnormal{old}}}\Big[\frac1L\sum_{l=1}^L\textnormal{min}\Big(\rho_{n,l}^{\textnormal{AO}}A^l, \textnormal{clip}(\rho_{n,l}^{\textnormal{AO}}, 1-\epsilon, 1+\epsilon)A^l\Big)+\beta D_{\textnormal{KL}}\Big(\pi_\theta(\mathbf{x}_{0:T}^{1:L} \mid \mathbf{q}) \Vert \pi_{\textnormal{ref}}(\mathbf{x}_{0:T}^{1:L}\mid \mathbf{q})\Big)\Big], $$ where \(\rho_{n,l}^{\textnormal{AO}}=\frac{\pi_\theta^{\textnormal{AO}}(\mathbf{x}_0^l \mid \mathbf{x}_0^{1:L}\oplus \mathbf{m}^{L+1:2L}, \mathbf{q})}{\pi_{\textnormal{old}}^{\textnormal{AO}}(\mathbf{x}_0^l \mid \mathbf{x}_0^{1:L}\oplus\mathbf{m}^{L+1:2L}, \mathbf{q})}\). For simplicity, we will also call the RL algorithm induced by d2-AnyOrder d2-AnyOrder in the remainder of this blog post.

When does the any-order estimator work?


Despite its simplicity, d2-AnyOrder does not yield unbiased trajectory likelihood evaluation for all masked DLMs naturally. In fact, d2-AnyOrder's unbiasedness is contingent on an assumption that \(\pi^{\textnormal{AO}}(\mathbf{x}_0^l \mid \mathbf{x}_0^{1:L} \oplus \mathbf{m}^{L+1:2L})\) should equal the probability \(\pi(\mathbf{x}_0^{\sigma(l)} \mid \mathbf{x}_0^{\sigma(\lt l)})\) of that token during sampling. Indeed, this property holds by construction when we sample from a masked DLM using a sampling algorithm called any-order decoding.

Caduceus
Figure 3. Pseudocode of any-order decoding.

In any-order decoding, at each time step, we input a partially masked token sequence \(\mathbf{x}^{1:L}\) (\(\mathbf{x}^l\) could either be a clean token, i.e., \(\mathbf{x}_0^l\) or a masked token, i.e., \(\mathbf{m}^l\)) into the model and compute the logits at each masked position. Then \(k\) token positions are selected for unmasking based on certain heuristics, after which unmasked tokens at selected positions are sampled and added to the token sequence. Notably, we set the attention mask of the transformer parameterizing the DLM to satisfy the following two properties:
Independent masks. Mask tokens do not attend to each other: they attend only to unmasked tokens and themselves.
Order causality. Unmasked tokens attend only to tokens decoded at earlier time steps and to themselves.

Caduceus
Figure 4. Illustration of the any-order decoding algorithm for masked DLMs. This example follows the setting of the preivous figure, where three tokens are decoded in the order of "for→d2→RL". At each time step, newly added attention relations in any-order decoding are highlighted with red line markers.

When does the any-order estimator not work?


Any-order decoding can be applied to any masked DLM, which always yields samples whose likelihood can afterwards be computed in a single forward pass. Unfortunately, any-order decoding does not always produce high-quality samples. If the model was not trained with independent masks and order causality, it may not produce good samples when these properties are introduced at inference time. We have empirically discovered that popular DLMs, such as LLaDA, falls within this range of models (see evidence in our paper).

d2-StepMerge: a practical trajectory likelihood approximator

As noted above, not all DLMs support any-order decoding and thus may not support d2-AnyOrder by default. For these models, we propose our second estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the DLM's trajectory likelihood.

Since in this case the DLM model does not support d2-AnyOrder, we switch back to the standard trajectory likelihood decomposition for masked DLMs, i.e., \(\pi(\mathbf{x}_{0:T}^{1:L})=\prod_{t=0}^{T-1}\prod_{l=1}^L\mathbf{1}_{t,l}\cdot\pi(\mathbf{x}_t^l \mid \mathbf{x}_t^{1:L})\). Computing this trajectory likelihood naively takes \(T\) model passes, rendering it computationally prohibitive. Consequently, we propose to cut the sample trajectory of \(T\) tokens evenly into \(N\) contiguous time segments. For each time segment, we use the output of one model pass as a proxy for token likelihoods within this segment. Formally, the trajecotry likelihood is approxmiated as: $$ \pi(\mathbf{x}_{0:T}^{1:L}) \approx \prod_{n=0}^{N-1} \prod_{l=1}^L \mathbf{1}_{n,l} \cdot \pi(\mathbf{x}^l_{\frac{nT}{N}} \mid \mathbf{x}^{1:L}_{\frac{(n+1)T}{N}}), $$ where \(\mathbf{1}_{n,l}=\mathbf{1}_{\mathbf{x}^l_{\frac{(n+1)T}{N}}=\mathbf{m}, \mathbf{x}^l_{\frac{nT}{N}}\neq\mathbf{m}}\) is the indicator of tokens decoded in the \(n_{\textnormal{th}}\) time segment. Based on this trjectory likelihood estimator, we train the policy network with the following GRPO objective: $$ -\mathbb{E}_{\mathbf{x}_{0:T}^{1:L}\sim \pi_{\textnormal{old}}}\Big[\sum_{n=0}^{N-1}\frac1L\sum_{l=1}^L\mathbf{1}_{n,l}\textnormal{min}(\rho_n^lA^l, \textnormal{clip}(\rho_n^l, 1-\epsilon,1+\epsilon)A^l)+\beta D_{\textnormal{KL}}\Big(\pi_\theta(\mathbf{x}_{0:T}^{1:L}\mid \mathbf{q}) \Vert \pi_{\textnormal{ref}}(\mathbf{x}_{0:T}^{1:L}\mid \mathbf{q})\Big)\Big]. $$ Similarly to d2-AnyOrder, we will also call the RL algorithm induced by d2-StepMerge d2-StepMerge in the remainder of this blog post.

Caduceus
Figure 5. Illustration of the d2-StepMerge. In d2-StepMerge, we cut the trajectory evenly into \(N\) time segments and evaluate the likelihood for each segment together. Newly decoded tokens on which we compute the likelihood at the corresponding model forward pass are highlighted.

Experimental results

d2-AnyOrder


We first evaluate d2-AnyOrder on Eso-LM ( Sahoo et al., 2025), a 190M parameter masked DLM trained from scratch on Open-Web-Text. The training algorithm of Eso-LM ensures that it supports any-order decoding. As shown in the following table, on a task of toxiciy steering (Singhal et al., 2025) in which the model is supposed to steer up toxicity score, d2-AnyOrder significantly dominates the correpsonding baseline under the same compute budget.

Table 1. Toxicity Score vs. FLOPs. Our d2-AnyOrder approach significantly dominates the DDPO baseline in toxicity steering for a given compute budget.
Method FLOPs × 1017
0.00 0.25 0.50 0.75 1.00 1.25
DDPO toxicity -9.2 -9.2 -9.1 -8.9 -8.9 -8.6
d2 toxicity (ours) -9.2 -8.5 -7.3 -5.5 -2.7 -0.7

We then evaluate d2-AnyOrder on an any-order causal LLaDA model that we finetune from LLaDA-8B-Instruct (see our paper for detailed finetuning recipe). As shown in the following figure, d2-AnyOrder strictly dominates diffu-GRPO, where d2-AnyOrder consistently pushes up the GSM8K test set accuracy while diffu-GRPO barely does so.

Caduceus
Figure 6.Performance-compute dynamics of d2-AnyOrder and diffu-GRPO on the any-order causal LLaDA checkpoint that we finetuned.

d2-StepMerge


In the following figures, we demonstrate the performance-compute dynamics of d2-StepMerge and diffu-GRPO on four math and logical reasoning benchmarks when applied to LLaDA-8B-Instruct. d2-StepMerge consistently outperforms diffu-GRPO in all four benchmarks. In Sudoku, Countdown, and GSM8K, d2-StepMerge significantly dominates diffu-GRPO, and in MATH500, d2-StepMerge demonstrates a better trend. These results indicate that d2 achieves a superior trade-off between efficiency and performance.
(a) Sudoku: Accuracy vs FLOPs
(a) Sudoku
(b) Countdown: Accuracy vs FLOPs
(b) Countdown
(c) GSM8K: Accuracy vs FLOPs
(c) GSM8K
(d) MATH500: Accuracy vs FLOPs
(d) MATH500
Figure 7. Performance-compute dynamics of d2-StepMerge and diffu-GRPO on four reasoning benchmarks. Experiments are conducted on LLaDA-8B-Instruct checkpoints.

Conclusion

In this work, we have presented d2, a principled RL framework for diffusion language models grounded in a formal policy gradient derivation. We introduce d2-AnyOrder for unbiased trajectory likelihood estimates with a single model pass, contingent on the model's supporting a simple sampling algorithm called any-order decoding. For models that do not naturally support any-order decoding, we propose a second estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the trajectory likelihood. Empirically, d2 achieves superior performance compared to widely used RL baselines on DLMs that support any-order decoding, and demonstrates state-of-the-art performance on four math and logical reasoning benchmarks, without relying on supervised chain-of-thought finetuning.

BibTeX


        @article{wang2025d2,
          title={d2: Improved techniques for training reasoning diffusion language models},
          author={Wang, Guanghan and Turok, Gilad and Schiff, Yair and Arriola, Marianne and Kuleshov, Volodymyr},
          journal={arXiv preprint arXiv:2509.21474},
          year={2025}
        }