Self-Evaluation Unlocks
Any-Step Text-to-Image Generation
Hover (or tap) an image to view its text prompt.
Modern text to image models are dominated by diffusion and flow matching due to their stability, scalability, and strong visual fidelity. However, they are inherently multi-step models: they learn local scores or velocities and therefore require dozens of steps to reliably traverse a curved reverse trajectory.
We introduce the Self-Evaluating Model (Self-E), a from scratch training method for any-step text to image generation without distillation from a pretrained teacher. Self-E learns from data similarly to flow matching, while simultaneously employing a self-evaluation mechanism that evaluates its own generated samples using current score estimates, effectively serving as a dynamic self-teacher. This complements local learning from data with an explicit notion of evaluation at the landing point: the model generates a candidate jump and then critiques and refines it using a learned score estimate at the landing point.
Two Complementary Signals
Self-E trains a single model with two complementary objectives: a learning from data component that provides local trajectory supervision, and a self-evaluation component that targets global distribution matching.
What it learns: local structure, i.e., the local score or velocity information that explains how density varies in nearby states. Concretely, we sample a real image $x_0$ with prompt $c$, add noise to obtain $x_t$, and train the model to predict the clean image from this noisy input using a conditional flow matching objective. This provides local trajectory supervision: it teaches reliable local behavior and is naturally most effective when generation follows a local path with many small steps.
What it learns: global correctness of the generated sample, i.e., whether a landed output is realistic and prompt-consistent. Instead of constraining the intermediate generation path, self-evaluation directly targets global distribution matching by treating the model output as a sample from its implicit distribution and pushing it toward the real data distribution. After the model proposes a long-range jump, it uses its own local estimator at the landing point to produce a direction signal that indicates how the current sample should move toward a better, more prompt-consistent region. In most of our training, this direction is the classifier-score computed from conditional and unconditional predictions, which we find stable and effective for improving text to image alignment. Later, we incorporate an additional fake-score that more directly supports distribution matching.
Conceptually, this can be viewed through an environment-agent lens. The environment corresponds to the local estimates learned from real data during training, while the agent is the generator used at inference time. The loop closes when local estimates are reused to evaluate landing points and improve long-range jumps.
Data Phase
The model learns local structure from real samples $(x_0, c)$ and their noisy states $x_t$, producing an evolving local score or velocity signal around noisy inputs. This yields an internal evaluator.
Self-Evaluation Phase
The model proposes a long-range jump and then performs sample evaluation to assess where it lands. This trains the generator to land in higher-density, prompt-consistent regions.
Closed Loop
Better learning from data improves the evaluator. A better evaluator improves few-step behavior. These components reinforce each other throughout training without pretrained teacher distillation.
GenEval Overall Across Step Counts
Self-E is consistently state of the art across step budgets and improves monotonically with more steps. 2, 4, 8, and 50 steps correspond to 0.753, 0.781, 0.785, and 0.815. The largest margin appears in the few-step regime, while performance remains top-tier at 8 and 50 steps.
Quantitative Comparison
| Method | 2 | 4 | 8 | 50 |
|---|---|---|---|---|
| SDXL | 0.0021 | 0.1576 | 0.3759 | 0.4601 |
| FLUX.1-Dev | 0.0998 | 0.3198 | 0.5893 | 0.7966 |
| LCM | 0.2624 | 0.3277 | 0.3398 | 0.3303 |
| SANA-1.5 | 0.1662 | 0.5725 | 0.7788 | 0.8062 |
| TiM | 0.6338 | 0.6867 | 0.7143 | 0.7797 |
| SDXL-Turbo | 0.4622 | 0.4766 | 0.4652 | 0.3983 |
| SD3.5-Turbo | 0.3635 | 0.7194 | 0.7071 | 0.6114 |
| Self-E | 0.7531 | 0.7806 | 0.7849 | 0.8151 |
Overall vs Steps
The Conceptual Shift
For a fixed noisy input, training can be viewed as learning directions on an energy landscape. In the animations, green corresponds to a score-driven better direction, blue corresponds to the model prediction, and dashed blue corresponds to the supervision signal used to update the model. The key shift is where supervision is applied: match a local direction at the start, or evaluate the quality of the landing point.
Diffusion
Diffusion provides a static target: for a given noisy input, its score function defines the ground-truth local direction. Training is standard supervised learning: update the model so its prediction aligns with the target.
Even with perfect local matching, inference still needs many steps. Starting from noise, the sampler must integrate these local directions step by step to follow a curved trajectory toward higher-density regions. This is why scaling model size alone cannot remove the step bottleneck: the limitation is geometry and numerical integration.
Animation over training iterations: the model is updated so its prediction aligns with the fixed local target; the dashed arrow indicates the update direction.
Self-E
Self-E changes the training target from matching a direction to reaching a good destination. At each iteration, the model proposes a long-range jump to a landing candidate. The landed point is evaluated: the local direction at the landing point indicates how to move toward a better, higher-density region. This produces a dynamic supervision signal that teaches the model to directly aim for better destinations.
In other words, the model proposes, the proposal is evaluated, and learning happens from feedback. This outcome-oriented supervision implicitly shapes a reliable shortcut path. Self-E training resembles a refinement step used during diffusion inference, while at inference time Self-E can output the shortcut directly.
Animation over training iterations: the model proposes a long-range jump, is evaluated at the landed point, and is updated using a feedback direction toward a better target.
Evaluate by itself
Evaluating a landing point requires a score-like signal to indicate whether the proposed destination is good, but this signal is not directly available. Prior work typically obtains it from a pretrained diffusion teacher. Self-E instead co-trains the evaluator via learning from data and reuses it to provide feedback to the generator. This enables a fully from scratch training setup without relying on any external model.
Future Work
Self-E introduces a pretraining paradigm that differs from trajectory-based training that mainly matches local directions along a path. By jointly learning local estimators from data and using them to supervise long-range jumps, Self-E enables flexible any-step inference without pretrained teacher distillation.
The current approach is still at an early stage. In extremely low step regimes, generated images can miss fine details compared with long multi-step inference. Several design choices remain underexplored, including objective weighting, inference-time scheduling, and its adapation for downsteam tasks. We expect systematic optimization of these factors to yield further gains.