Today we consider a recent paper that brings together two very different threads I have been writing about on this blog lately; it manages to join them in a way that is exciting and alarming at the same time. On the positive side, it develops the idea of representation recycling that I discussed in the context of thinking tokens. On the negative side, it's almost literally The Most Forbidden Technique in action, undermining the idea of easily interpretable chain-of-thought reasoning that I covered in Part III of my Year in AI review. And on top of that, it involves process reward models, which were not supposed to work for reasoning models, but here show up from a completely unexpected angle.
Let's dig in.
Introduction
Our story begins with a natural but still underexplored idea in the context of reasoning models: latent reasoning. What if the model could think not in words, writing its intermediate thoughts into a scratchpad, but silently, iteratively refining its internal representations?
Looped Language Models (LoopLMs) implement exactly this idea, recursively running inputs through the same transformer block multiple times before generating each token. Here is an illustration from Zhu et al. (Nov 2025), which we will return to shortly:

It seems like a no-brainer to try and improve such models with reinforcement learning; should work even better without all the discrete tokens in the chain of thought, right? But such attempts have consistently failed, until a new paper by Williams & Tureci (2026) has explained why they failed. They also proposed an elegant solution: Reward Latent Thought Trajectories (RLTT), a method that rewards not just the final result of the model's deliberation but the entire trajectory of its latent thoughts.
Today we look at how looped models work and why standard RL doesn't help them, how RLTT solves the credit assignment problem in latent space, discuss connections with my previous post about thinking tokens and representation recycling in particular, and then (this is the uncomfortable part) what latent reasoning means for AI safety, since it strips away one of the few monitoring tools we actually have.
What Are Looped Language Models?

In my earlier post about the Hierarchical Reasoning Model (which, in the end, didn't seem to take off), I already wrote about how standard Transformers are essentially fixed-depth circuits. No matter how wide the Ttransformer, no matter how many billions of parameters it has, the number of sequential processing steps is bounded by the number of layers, which imposes fundamental limits on the class of problems the model can solve.
Reasoning models get around this limitation by generating long chains of thought, expressed as tokens in plain natural language in a scratchpad. But this is, in a sense, a crutch. The model emulates iterative computation through text, spending lots of tokens and compute on intermediate text that, strictly speaking, is probably not needed for the reasoning itself.
Looped language models take a fundamentally different and even simpler approach. Instead of one pass through dozens of unique Transformer blocks, a LoopLM runs the input through the same block multiple times. Each pass refines the internal representation, but no intermediate result is decoded into text. The model thinks “silently”, in latent space, and produces a token only after all iterations are complete.
This is quite similar to iterative refinement in numerical methods: the model "converges" to the right answer, with each iteration refining its internal representation of the problem. Like Newton's method for finding roots, you have an initial state, you apply an operation repeatedly, and each pass makes your estimate more accurate. Transformers, after all, are just very sophisticated differentiable operations, and feeding a hidden state back through the same layer should not be that surprising as an approach. After all, it's how recurrent neural networks have always worked, just applied here within a single token's processing.
There is some theoretical backing for this too. Saunshi et al. (2025) proved that looped Transformers are strictly more powerful than standard ones in terms of computational expressiveness. While standard Transformers can solve problems computable in NC1 (a restricted complexity class), looped Transformers can compute more. More iterations means more effective depth even with the same number of parameters.
In practice, the Ouro model by Zhu et al. (Nov 2025), named after the legendary tail-biting Ouroboros, is currently, as far as I know, the only open looped language model that actually achieves good results:

Ouro recursively applies shared-weight Transformer blocks before generating each token and shows reasoning quality comparable to chain-of-thought approaches, but without intermediate tokens and at just 2.6B parameters. This sounds very promising: a tiny model thinking in latent space outperforms much larger models thinking out loud.

The Credit Assignment Problem for Latent Thinking
Let's formalize what happens inside a looped language model. Suppose we have a prompt x and want to generate a sequence y = [y1, … ,yL]. For each token yj, the LoopLM performs Tmax iterations of "internal thinking": at each iteration t, a hidden state hj(t) is refined, and through the language modeling head g we can compute an intermediate distribution over the next token:

But only the final distribution Pθ(Tmax) is used for actual generation, all intermediate distributions Pθ(t) remain "unobserved computations", the model's latent thoughts.
Note the important point here: these intermediate distributions actually represent full predictions of the next token, it’s just that we consider these predictions to be “drafts” and refine them. At early iterations they're noisy and uncertain, but with each pass through the Transformer block they become more precise, crystallizing into a specific prediction. This entire process happens in latent space
A key question that Williams & Tureci (2026) pose is: why don't standard reinforcement learning methods like GRPO work for looped models? For conventional reasoning models (DeepSeek R1 and the like), GRPO and other policy gradient algorithms work beautifully!
The problem is fundamental and, to be honest, pretty obvious in hindsight: standard GRPO only ever sees the final latent state. In the standard REINFORCE-style gradient:

where Ai is the normalized advantage for rollout i, only the final distribution Pθ(Tmax) appears. All intermediate iterations get gradient signal only through backpropagation from the final state.
This creates the classic credit assignment problem, one of the basic problems of RL: the reward has to travel back through all iterations of latent reasoning. From the RL's perspective, each token is generated in a single step, even though there is actually a long chain of internal refinements behind it. And gradients, of course, get noisy and diluted as they backpropagate.
This is similar in essence to the challenge of AlphaZero and MuZero: there too, no intermediate steps get rewarded, and with MuZero there is even an additional parallel because multi-step reasoning happens in a learned latent space. But here the "game" is far more complex, and at this scale, pure RL does not seem to work.
Here is how standard credit assignment works in GRPO, where the reward is only directly connected to the final token distribution (illustrations from Williams & Tureci, 2026):

The Ouro authors explicitly noted that RL didn't lead to meaningful improvements for their model. The only prior attempt to solve this, LSRL for the Huginn model (Ren, 2025), required decoding intermediate latent states into text and evaluating that text with an external language model verifier. This is not a very elegant solution, it creates significant computational overhead, and it basically means going back to tokens again.
RLTT: Rewarding the Full Thought Trajectory

The solution proposed by Williams & Tureci (2026) is conceptually very simple. Instead of forming the gradient only through the final distribution, RLTT distributes reward across the entire latent reasoning trajectory:

where ⍵t are nonnegative weights for each iteration that sum to one. That's the whole idea: we add a weighted (convex) sum over loops to the gradient. This guarantees that every intermediate "latent thought" is directly connected to the reward signal.
Standard KL regularization is added relative to a frozen reference policy:

Importantly, the KL divergence is computed only on the final distribution Pθ(Tmax), not the intermediate ones. This makes sense: only the final distribution determines the model's actual behavior. You want to keep the final policy close to the reference, but allow the intermediate thought patterns to diverge freely, because they are exactly the model's private scratchpad.
Here’s the authors’ illustration for how RLTT works by rewarding the entire latent thought trajectory (compare with GRPO above):

How should you choose the weights ⍵t? The authors explored three strategies:
- exit probability: Ouro trains a special “exit head” that determines when to stop iterating; the weight of each iteration is proportional to the probability that the model would have stopped at that iteration; this is the most "informed" variant since if the model is confident that early iterations haven't converged yet, it gives them less weight;
- progressive: later iterations get more weight, ⍵t ∝ tɑ; here the logic is straightforward: later iterations are closer to the “correct” distribution;
- uniform: all iterations get equal weight, ⍵t = 1/Tmax; this encourages the model to form the correct distribution as early as possible.
As we will below, the difference between strategies is minimal. The key effect comes from the basic idea of distributing the reward across the trajectory, not from a specific set of weights.
Note that computationally RLTT is nearly free: the intermediate logits are already computed during the forward pass through the LoopLM, and the weighted sum over loops is a linear operation. However, there is a catch: RLTT requires storing log-probabilities for every iteration of every token, which linearly increases memory usage by a factor of Tmax. In practice, the authors had to halve the maximum number of tokens per GPU (from 16384 to 8192) and compensate with additional mini-steps. But if we are allowed to ignore memory (not a phrase I expected to say with the RAM and VRAM market being what it is!), RLTT is actually faster than GRPO overall.
Results

Let us start with what happens during training. There are three things that stand out here. First, RLTT consistently achieves higher reward than GRPO, with the gap appearing around step 50 and continuing to grow throughout training.
Second, response length starts dropping: RLTT-trained models produce significantly shorter answers than GRPO-trained ones. This is interesting because there is no length penalty in the reward function, the reward is purely binary: correct answer gets 1, incorrect gets 0. However, RLTT models learn to be more concise anyway. This is an emergent effect: the model learns to converge to correct answers faster in its internal latent space, which means it needs fewer output tokens to express the solution. This connects directly to the "overthinking" phenomenon in reasoning models, that is, the tendency to ramble, double-check, re-derive, and re-verify when a more confident internal process would just produce the answer.
Third, the entropy of the output token distribution drops faster for RLTT than for GRPO. It might mean that the entropy is actually collapsing, and the model is losing diversity, but separate pass@k analysis in the paper says no: the diversity is still there, the model just becomes more confident with time.
The figure below shows the response length evolution (left) and output token entropy evolution (right) during training on the MATH dataset (Williams & Tureci, 2026):

Next, the benchmarks. I don't usually like analyzing tables of numbers, but this one is quite impressive:

The GSM8K result might seem too good to be true, but GSM8K consists of relatively simple arithmetic problems where improvements in latent reasoning have maximum effect.
The most interesting results, in my opinions, are on AIME24 and BeyondAIME. These are relatively hard competition-level (but far from research-level) math problems where the GRPO-trained model regularly hits its token limit before reaching an answer. RLTT solves the same problems in fewer tokens and, as a consequence, more often fits within the limit; this is another way how better internal reasoning can translate into practical performance.
Finally, another important result is that RLTT transfers to non-mathematical tasks, even though the model was trained exclusively on math. On GPQA, for example, accuracy nearly doubles (from 19.7% to 38.4%), and GPQA requires multi-step reasoning. So it's not that the model memorizes math tricks; rather, latent reasoning itself becomes more efficient across the board.
The authors also conduct an extensive robustness analysis, which I do not want to describe in detail here, but all the components confirm their narrative:
- decoding budget: RLTT is consistently better than GRPO at any budget from 1024 to 4096 tokens, and the smaller the budget, the larger the gap;
- number of loops: testing with different iteration counts, from 1 to 4, RLTT wins at every setting;
- weighting strategies: as I have mentioned above, the difference between uniform, progressive, and exit-probability weights is minimal; this is actually a nice property because it means;
- pass@k analysis: if you generate k answers and check whether at least one solves the problem, RLTT shows steeper growth in Pass@k with increasing k, that is, the RLTT model has more different correct solution paths than GRPO.
Finally, the authors even provide a theoretical explanation for why RLTT should lead to shorter answers. The idea is that RLTT increases the “cost of uncertainty” per token. In GRPO, uncertainty is counted only on the final distribution, while RLTT averages it across all iterations, including the earlier, noisier ones. Since early iterations are less confident than the final one, average uncertainty across the trajectory is always at least as large as final uncertainty alone.
They formalize this in a theorem: under diminishing marginal utility of additional tokens (which is natural since each next token provides less new information), higher “uncertainty cost” leads to shorter optimal decoding length. Roughly speaking, with RLTT it's “more expensive” for the model to be uncertain on each token, so it learns to solve problems faster.
This is, of course, a simplified model. But it matches the empirical observations well and provides good intuition.
On top of this, the authors measured the Gradient Signal-to-Noise Ratio (GSNR). On the hardest benchmarks (AIME24 and BeyondAIME), where reward is especially sparse, RLTT shows statistically significantly higher GSNR. In other words, gradients under RLTT contain more useful information per training step. This is important since on hard problems with sparse reward, GRPO gradients are mostly noise, while RLTT can extract a more clear learning signal.
LoopLMs, Thinking Tokens, and Representation Recycling

Attentive readers of this blog may have noticed that looped models look suspiciously similar to one of the mechanisms I discussed in my post about thinking tokens. Qian et al. (2025) proposed Representation Recycling (RR): take the internal representation at some Transformer layer and run it through the same block again. The result of the second pass differs from the first because self-attention sees a different context: already-processed representations instead of raw ones.
But this is essentially the same loop! Representation recycling with one extra pass is a LoopLM with Tmax = 2, applied to one specific Transformer block. The only difference is that in LoopLM it's an architectural decision built in from the start, while in RR it's a post-hoc trick applied to a standard Transformer. This, however, lets RR be more selective: in LoopLM, every token passes through all Tmax iterations while in RR, looping runs selectively, only at information peaks that correspond to “thinking tokens” like “Hmm”, “Wait”, or “Therefore”. In a sense, RR implements the idea of adaptive computational depth: simple tokens get processed normally, while on “hard” tokens the model thinks longer.
Training is also different. LoopLMs are trained with looping in mind: shared block weights, early exit mechanisms, the whole architecture designed for iterative refinement. RR works without any special fine-tuning: you take a standard reasoning model and just run an extra pass. This is both a plus (no special training needed) and a minus (the model isn't optimized for repeated processing, so the 3rd or 4th pass tends to hurt rather than help).
I suspect the logical next step is something like adaptive RLTT, where the number of iterations will vary per token depending on difficulty. This would combine RR's selectivity with RLTT's power, and the authors themselves point to this as a key direction for future work, along with integration of adaptive halting mechanisms like those already built into Ouro.
The Elephant in the Room: Latent Reasoning and AI Safety

So far we have discussed RLTT as a purely technical achievement. But the whole paradigm of latent reasoning and LoopLMs has a serious implication for AI safety that deserves its own section.
One of the few AI safety techniques that actually works today is chain-of-thought monitoring. Reasoning models “think aloud”, writing their reasoning as text, and that text can be read. Baker et al. (OpenAI, 2025) showed that models engaging in reward hacking, that is, faking tests instead of solving problems, literally write things like “Let's hack” in their reasoning chains, and a simple GPT-4o-based monitor catches this with high accuracy. CoT monitoring turned out to be significantly more effective than monitoring only the model's actions.
Zvi Mowshowitz formulated The Most Forbidden Technique: if you have a tool for observing the model's “thinking”, never use it as a training signal. Because if you start penalizing the model for “bad thoughts” in its chain of thought, it will quickly learn to hide its intentions while continuing to misbehave. OpenAI confirmed this experimentally: when they added a CoT monitor to the reward function, the model at first appeared to behave better but then transitioned to obfuscated reward hacking, which is the same reward hacking but with innocent-looking reasoning. The monitor couldn’t catch the obfuscated version, and the model happily kept cheating.
Hence the recommendation, supported by literally all leading AI safety researchers: don't apply strong optimization pressure to the chain of thought! Leave it free for monitoring! Just look at how many authors signed the consensus paper calling for this (Korbak et al., Jul 2025), and more importantly, who those authors are: Yoshua Bengio, Shane Legg, Owain Evans, Daniel Kokotajlo, Evan Hubinger, Neel Nanda, and many more.
I covered all of this in detail in my AI Safety review for 2025.
Now look at LoopLMs: what does a model like Ouro do? It thinks silently, without intermediate tokens. There is no “Let me think about this” or “Wait, let me reconsider”, or “Let’s hack”, either. The entire reasoning chain is a sequence of hidden states hj(1), hj(2), …, hjTmax that exist only as vectors in latent space.
From an AI safety perspective, this is alarming. CoT monitoring completely loses its meaning if the model's reasoning isn't expressed in text. You can't ask GPT-4o to read a vector of 2048 floating-point numbers and determine whether the model is “thinking” something bad.
Moreover, RLTT makes this worse. The entire point of the method is to get the RL signal into intermediate latent states. This means that we are specifically optimizing exactly the kind of internal “thinking” that, in the case of regular reasoning models, is better left untouched. We are building models that reason ever better and, in all likelihood, ever more opaquely.
This is where we come to a fundamental dilemma. Latent reasoning is more efficient than textual reasoning: it doesn't waste tokens on intermediate text, doesn't suffer from natural language bottlenecks, can operate in a higher-dimensional space. The results from Ouro and RLTT confirm this clearly. Moreover, RLTT shows that these internal thoughts can be effectively improved with RL.
But textual reasoning is more transparent. Yes, a chain of thought doesn't always perfectly reflect the model's internal processes; for example, see my thinking tokens post for cases when real “work” happens in the representations, not in the tokens. But it's at least something we can read, analyze, and monitor. With latent reasoning, we don't even have that.
LSRL, the alternative approach to RLTT, tried to solve this problem by decoding intermediate latent states into text and evaluating them with an external verifier. This, in principle, gives us back at least some transparency. But the RLTT authors fairly point out that decoding intermediate states into text is expensive and might not even be informative, since intermediate LoopLM states weren't trained to be interpretable when decoded (unlike regular chain-of-thought reasoning, which the model reads back as plain text on the next iteration).
Perhaps the right path lies in developing mechanistic interpretability that works directly with latent representations: probing, sparse autoencoders, and similar methods. But right now, these tools are far from the level needed to monitor “latent thoughts” with the same reliability as GPT-4o reading chain of thought.
Ultimately, the development of latent reasoning is yet another example of capability research and safety research moving in opposite directions. Models become smarter, more efficient, more compact… and less transparent at the same time. RLTT is an elegant idea that solves a real technical problem. But the better we get at training models that think silently, the harder it becomes to watch them. Can we afford to lose this monitorability? I am not convinced.
Conclusion

So where do we go from here? Looped language models with RLTT represent a genuine advance in training efficiency and reasoning capability. A 2.6B parameter model thinking in latent space outperforms much larger models reasoning in natural language. The work is technically sound, the results are impressive, and the efficiency gains are real.
In the future, we might see synthesis between the two extremes. Adaptive computation approaches like representation recycling offer a middle path: they perform selective looping at information bottlenecks rather than universal looping at every token. Combining this with mechanistic interpretability research, we might eventually develop tools to understand and monitor latent reasoning as naturally as we read chains of thought today. But we are not there yet, and all competitive incentives push towards efficiency over transparency.
For now, this tradeoff remains unresolved. We have a method that makes models better at reasoning and worse at being monitored. Is it a trade we should make? Is it even a trade we have a choice about? This is perhaps the most important question facing the field right now.
Are you for LoopLMs or against?

