Self-Evaluation Unlocks
Any-Step Text-to-Image Generation

Authors
1The University of Hong Kong    2Adobe Research
*Corresponding author.    Project lead.
December 2025
Teaser
Figure 1. One model, any compute: Self-E generates coherent images at 2, 4, 8, and 50 steps.
Hover or tap an image to preview its prompt

Hover (or tap) an image to view its text prompt.

Introduction

Modern text to image models are dominated by diffusion and flow matching due to their stability, scalability, and strong visual fidelity. However, they are inherently multi-step models: they learn local scores or velocities and therefore require dozens of steps to reliably traverse a curved reverse trajectory.

We introduce the Self-Evaluating Model (Self-E), a from scratch training method for any-step text to image generation without distillation from a pretrained teacher. Self-E learns from data similarly to flow matching, while simultaneously employing a self-evaluation mechanism that evaluates its own generated samples using current score estimates, effectively serving as a dynamic self-teacher. This complements local learning from data with an explicit notion of evaluation at the landing point: the model generates a candidate jump and then critiques and refines it using a learned score estimate at the landing point.


Method

Two Complementary Signals

Self-E trains a single model with two complementary objectives: a learning from data component that provides local trajectory supervision, and a self-evaluation component that targets global distribution matching.

Self-E method overview
Self-E simultaneously learns from data while performing self-evaluation, using the same network in two complementary modes.
Learning from data

What it learns: local structure, i.e., the local score or velocity information that explains how density varies in nearby states. Concretely, we sample a real image $x_0$ with prompt $c$, add noise to obtain $x_t$, and train the model to predict the clean image from this noisy input using a conditional flow matching objective. This provides local trajectory supervision: it teaches reliable local behavior and is naturally most effective when generation follows a local path with many small steps.

Learning by self-evaluation

What it learns: global correctness of the generated sample, i.e., whether a landed output is realistic and prompt-consistent. Instead of constraining the intermediate generation path, self-evaluation directly targets global distribution matching by treating the model output as a sample from its implicit distribution and pushing it toward the real data distribution. After the model proposes a long-range jump, it uses its own local estimator at the landing point to produce a direction signal that indicates how the current sample should move toward a better, more prompt-consistent region. In most of our training, this direction is the classifier-score computed from conditional and unconditional predictions, which we find stable and effective for improving text to image alignment. Later, we incorporate an additional fake-score that more directly supports distribution matching.

Closed loop

Conceptually, this can be viewed through an environment-agent lens. The environment corresponds to the local estimates learned from real data during training, while the agent is the generator used at inference time. The loop closes when local estimates are reused to evaluate landing points and improve long-range jumps.

1

Data Phase

The model learns local structure from real samples $(x_0, c)$ and their noisy states $x_t$, producing an evolving local score or velocity signal around noisy inputs. This yields an internal evaluator.

2

Self-Evaluation Phase

The model proposes a long-range jump and then performs sample evaluation to assess where it lands. This trains the generator to land in higher-density, prompt-consistent regions.

3

Closed Loop

Better learning from data improves the evaluator. A better evaluator improves few-step behavior. These components reinforce each other throughout training without pretrained teacher distillation.

Results

GenEval Overall Across Step Counts

Self-E is consistently state of the art across step budgets and improves monotonically with more steps. 2, 4, 8, and 50 steps correspond to 0.753, 0.781, 0.785, and 0.815. The largest margin appears in the few-step regime, while performance remains top-tier at 8 and 50 steps.

Quantitative Comparison

Metric: GenEval Overall
Method 2 4 8 50
SDXL 0.0021 0.1576 0.3759 0.4601
FLUX.1-Dev 0.0998 0.3198 0.5893 0.7966
LCM 0.2624 0.3277 0.3398 0.3303
SANA-1.5 0.1662 0.5725 0.7788 0.8062
TiM 0.6338 0.6867 0.7143 0.7797
SDXL-Turbo 0.4622 0.4766 0.4652 0.3983
SD3.5-Turbo 0.3635 0.7194 0.7071 0.6114
Self-E 0.7531 0.7806 0.7849 0.8151

Overall vs Steps

x-axis: step count, y-axis: GenEval Overall
SDXL FLUX.1-Dev LCM SANA-1.5 TiM SDXL-Turbo SD3.5-Turbo Self-E
Qualitative comparison
Qualitative comparison. Side-by-side visual results across different step budgets.
Matching to Evaluation

The Conceptual Shift

For a fixed noisy input, training can be viewed as learning directions on an energy landscape. In the animations, green corresponds to a score-driven better direction, blue corresponds to the model prediction, and dashed blue corresponds to the supervision signal used to update the model. The key shift is where supervision is applied: match a local direction at the start, or evaluate the quality of the landing point.

Matching at the start point

Diffusion

Diffusion provides a static target: for a given noisy input, its score function defines the ground-truth local direction. Training is standard supervised learning: update the model so its prediction aligns with the target.

Even with perfect local matching, inference still needs many steps. Starting from noise, the sampler must integrate these local directions step by step to follow a curved trajectory toward higher-density regions. This is why scaling model size alone cannot remove the step bottleneck: the limitation is geometry and numerical integration.

Diffusion training animation

Animation over training iterations: the model is updated so its prediction aligns with the fixed local target; the dashed arrow indicates the update direction.

Supervision
Fixed local target at the start point.
What it learns
A local vector field that supports many-step integration, not a shortcut.
Consequence
Few large steps extrapolate and tend to drift toward average behavior.
Evaluation at the landing point

Self-E

Self-E changes the training target from matching a direction to reaching a good destination. At each iteration, the model proposes a long-range jump to a landing candidate. The landed point is evaluated: the local direction at the landing point indicates how to move toward a better, higher-density region. This produces a dynamic supervision signal that teaches the model to directly aim for better destinations.

In other words, the model proposes, the proposal is evaluated, and learning happens from feedback. This outcome-oriented supervision implicitly shapes a reliable shortcut path. Self-E training resembles a refinement step used during diffusion inference, while at inference time Self-E can output the shortcut directly.

Self-E training animation

Animation over training iterations: the model proposes a long-range jump, is evaluated at the landed point, and is updated using a feedback direction toward a better target.

Supervision
Feedback at the landing point, dynamic rather than fixed at the start.
What it learns
How to land in good enough regions in few steps.
Consequence
The model implicitly learns a shortcut path.
Where does the evaluation signal come from

Evaluate by itself

Evaluating a landing point requires a score-like signal to indicate whether the proposed destination is good, but this signal is not directly available. Prior work typically obtains it from a pretrained diffusion teacher. Self-E instead co-trains the evaluator via learning from data and reuses it to provide feedback to the generator. This enables a fully from scratch training setup without relying on any external model.

Conclusion

Future Work

Self-E introduces a pretraining paradigm that differs from trajectory-based training that mainly matches local directions along a path. By jointly learning local estimators from data and using them to supervise long-range jumps, Self-E enables flexible any-step inference without pretrained teacher distillation.

The current approach is still at an early stage. In extremely low step regimes, generated images can miss fine details compared with long multi-step inference. Several design choices remain underexplored, including objective weighting, inference-time scheduling, and its adapation for downsteam tasks. We expect systematic optimization of these factors to yield further gains.

Citation
BibTeX
@article{yu2025selfe,
  title={Self-Evaluation Unlocks Any-Step Text to Image Generation},
  author={Yu, Xin and Qi, Xiaojuan and Li, Zhengqi and Zhang, Kai and Zhang, Richard and Lin, Zhe and Shechtman, Eli and Wang, Tianyu and Nitzan, Yotam},
  journal={arXiv preprint},
  year={2025}
}