Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL

zenodo.org

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL

zenodo.org

eifachposteMB to AI (Reddit RSS)English · 14 hours ago

Recent growth in different reinforcement learning (RL) techniques have surfaced a need for a wide variety of specialized training environments. These environments are typically hand-curated, with task and reward difficulties that are fixed rather than adaptive, making them ineffective training signals once a model's performance on the domain improves. As models continue to improve on these environments and reward signals grow increasingly sparse over longer horizons, the model encounters fewer diverse situations during rollouts, leaving it prone to overfitting on specific workflows or tool structures, also known as mode collapse. World models that simulate environment states have previously matched the performance of pure environment rollouts, making them a promising avenue for scaling diversity given that their outputs can be varied on-demand and at scale. However, autoregressive (AR) world models suffer from a fundamental left-to-right bias that prevents them from conditioning on globally interdependent state anchors such as tool schemas, prior turns, and expected outcomes. In this work, we (i) formalize text-based world modeling as a steerable transition-dynamics problem decomposed into initial environment state, task context, tool schemas, domain rules, and steering directives, and (ii) curate a dataset of 239,403 grounded state–action trajectories spanning nine open-source environments and twelve frontier model families. Using this dataset, we present a comparative study between AR LMs and masked diffusion language models (MDLMs), and show that MDLMs, by virtue of bidirectional anchor-aware denoising, produce better coherence, groundedness, and empirically validated rollout diversity than LLMs more than 4x their total parameter size, with comparable inference latency. We introduce a plug-and-play GRPO training framework with deterministic state checks, and perform zero-shot transfer ablations on three out-of-distribution environments (ScienceWorld, ALFWorld, AppWorld) across three agent backbones from 1.2B–7B parameters (LFM2.5, Qwen3, Mistral), achieving absolute gains of up to 47% over raw baselines without environment-specific fine-tuning. Finally, we conduct a behavioral analysis of failure modes under adversarial scenarios and a human evaluation centered on realism, outcome correctness, and training utility to showcase their reliability. We open source our work to encourage research in this direction

Original Reddit post

Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs’ any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting. submitted by /u/Megixist

Originally posted by u/Megixist on r/ArtificialInteligence

You must log in or # to comment.

Chat