Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
Abstract
An axiomatic evaluation framework reveals systematic failures in latent thought representations of LLMs across multiple reasoning tasks, demonstrating that current representations fail to satisfy fundamental functional axioms consistently across different model architectures.
We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.
Community
We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.
It would work better if you used Asolaria ASI as A Nerual network. Uses 0 gpu
One axiom I'd add: necessity.
It checks that the model actually uses the latent state. Without it, a model can sometimes make $T$ look good while routing the real answer through residual prompt information or decoder priors.
To support this axiom, I would append three additional checks in a real training/optimization framework beyond the paper.
First, use latent ablation necessity. After the model produces $T$, corrupt it, swap it with another example’s $T$, zero it, or inject Gaussian noise. If answer quality barely changes, the model is not using the latent thought. A good latent reasoner should degrade gracefully under small noise but fail under semantic swaps.
Second, use counterfactual latent intervention. Take two minimally different inputs $x$ and $x'$ with different answers. Swap their latent thoughts. If the model follows the swapped latent state, $T$ is causally active. If it ignores the swap and answers from the prompt alone, the latent state is decorative.
Third, use multi-window causality, not only final-answer causality. Do not train $T$ only to predict the final answer. Split explicit reasoning traces into many windows and require latent states to substitute for intermediate reasoning prefixes. Otherwise, a latent vector that encodes only “answer = 42” could pass some final-output tests without representing the reasoning process.
$T$ can be skipped if it never helps the model in a predictable situation to save compute with early exiting.
Okay, turns out LaTeX is not supported in Paper comments. Embarassing, but I'm too lazy to edit right now because I'm on mobile.
Get this paper in your agent:
hf papers read 2606.27378 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper