AMAZE

Probing Visual Planning in Image Editing Models

Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma

Shanghai Jiao Tong University | Renmin University of China | State Key Laboratory of General Artificial Intelligence, BIGAI

Paper Github Code Dataset Visualization Leaderboard

Figure 1. The AMAZE tasks. Maze and Queen provide complementary probes of visual planning, spanning continuous vs. discrete solution spaces, sequential vs. parallel planning, and local vs. global constraints.

What is AMAZE?

AMAZE is a benchmark for probing intrinsic visual planning in image editing models. Instead of converting visual planning into text-centric reasoning, the paper studies whether editing models can solve planning tasks directly in image space.

The benchmark contains two abstract puzzle families: Maze and Queen. Maze emphasizes sequential path planning under local geometric constraints, while Queen emphasizes combinatorial placement under global constraints. Their abstract structure minimizes recognition complexity and makes logical reasoning the core challenge.

The accompanying paradigm, EAR (Editing as Reasoning), reformulates planning as a single-step image transformation. AMAZE then evaluates outputs with automatic metrics that separate logical validity from pixel-wise fidelity.

Editing as Reasoning (EAR)

EAR asks an image editing model to transform an input puzzle image directly into a solved image. This avoids explicit step-by-step planning-by-generation and turns the whole planning process into one atomic visual edit.

AMAZE supports automatic evaluation on both tasks. Logical validity is measured through Coverage, Violation, and Pass, while pixel-wise fidelity is measured using MSE-In and MSE-Out. The paper reports 98% agreement between the automatic logical metric and human judges.

EAR Automatic Evaluation Maze + Queen

Figure 2. Overview of EAR. Left: editing-as-reasoning pipeline. Right: automatic evaluation for logical validity and pixel-level fidelity.

Circle

Hexagon

Square

Triangle

Queen

Maze and Queen denoising traces from a fine-tuned Bagel model.

Benchmark Highlights

Complementary Tasks

Maze: continuous route planning with circle, hexagon, square, and triangle geometries.
Queen: discrete placement planning with row, column, region, and adjacency constraints.
Two complementary reasoning regimes: sequential/local and combinatorial/global.

Automatic Metrics

Coverage: how much of the target solution is correctly generated.
Violation: how much of the generated solution departs from the target.
Pass: logical correctness after balancing coverage and violation.
MSE-In / MSE-Out: pixel-level fidelity inside and outside the solution region.

Data Coverage

Maze: 2,800 test examples.
Queen: 350 test examples.
Maze scales from 3×3 to 16×16; Queen scales from 4×4 to 10×10.

Evaluated Models

Proprietary: GPT-Image-1, NanoBanana-Pro, Seedream-4.5
Open-source: Flux-Kontext-Dev, Qwen-Image-Edit, Bagel, Janus-Pro
Includes diffusion-based and autoregressive editing models.

Main Results on AMAZE

Results (%) on Maze and Queen. Lower is better for Violation and MSE, higher is better for Coverage and Pass.

Continuous (Maze) Task

Model	Violation	Coverage	MSE In	MSE Out	Pass@1	Pass@5
proprietary models
GPT-Image-1	62.88	58.97	41.16	52.76	5.40	6.06
NanoBanana-Pro	47.76	64.21	24.20	17.21	4.82	9.28
Seedream-4.5	16.90	25.67	28.82	30.96	2.14	3.21
open-source models (zero-shot)
Flux-Kontext-Dev	23.84	30.24	30.96	18.31	0.36	3.57
Qwen-Image-Edit	19.37	28.51	18.82	5.70	1.43	2.14
Bagel	28.91	27.15	11.64	5.84	0.00	1.00
Janus-Pro	5.41	1.85	57.47	76.80	0.00	0.00
fine-tuned open-source models
Bagel (fine-tuned)	12.21	51.02	8.66	3.07	11.54	23.64
Janus-Pro (fine-tuned)	35.60	23.33	55.99	50.94	1.43	2.22

Discrete (Queen) Task

Model	Violation	Coverage	MSE In	MSE Out	Pass@1	Pass@5
proprietary models
GPT-Image-1	62.91	37.09	11.84	5.87	0.00	2.28
NanoBanana-Pro	32.56	67.43	9.10	1.62	30.35	35.58
Seedream-4.5	76.86	23.14	11.55	5.95	2.86	2.86
open-source models (zero-shot)
Flux-Kontext-Dev	78.63	21.37	11.48	7.71	0.92	2.34
Qwen-Image-Edit	69.52	30.47	8.83	5.30	2.86	4.00
Bagel	61.57	38.43	8.94	1.22	0.00	0.00
Janus-Pro	84.24	15.76	12.97	9.83	0.00	0.57
fine-tuned open-source models
Bagel (fine-tuned)	68.27	31.73	6.05	0.63	14.57	14.29
Janus-Pro (fine-tuned)	16.07	83.93	7.91	1.38	12.57	13.03

Zero-shot performance is generally weak. Fine-tuning Bagel on basic scales yields strong gains and better generalization, while diffusion-based editing appears more effective than autoregressive editing for developing visual planning logic.

Generalization

Cross-geometry generalization

Training on hexagonal mazes transfers best to other geometries, suggesting that a richer directional action space encourages geometry-invariant planning rather than narrow in-domain pattern matching.

Cross-scale generalization

Maze generalizes surprisingly far from simple training scales, while Queen requires larger training scales before it develops non-trivial transfer. This highlights a clear difference between sequential path planning and global combinatorial planning.

Generalization across scales for Maze and Queen

Scaling Effect

Joint scaling of data and compute

Scaling up training data yields early gains that later saturate, while scaling up compute continues to improve performance, especially in later training stages. Queen benefits more from diverse data early on, while Maze benefits from longer optimization.

Human Comparison

Success rate under time budgets

Human solvers improve much more with added time than the model, especially on harder tasks. The model remains comparatively flat, revealing a persistent efficiency gap in visual planning.

Human and model success rates under different time budgets

Correlation with human groups

On Maze, the model correlates most with the 18-year-old group; on Queen, it resembles the 6-year-old group more closely. This suggests that combinatorial global planning remains much harder than sequential route planning.

Correlation between model and human groups

Conclusion

AMAZE shows that current image editing models still have limited abstract visual planning ability. EAR provides a direct way to study reasoning in image space, and supervised fine-tuning on simple tasks yields surprisingly strong generalization. Still, even the best fine-tuned model falls short of the near-instant, zero-shot efficiency of human solvers.

BibTeX

If you find our work or dataset useful for your research, please consider citing:

@article{zhou2026amaze,
  title={Probing Visual Planning in Image Editing Models},
  author={Zhimu Zhou and Yanpeng Zhao and Qiuyu Liao and Bo Zhao and Xiaojian Ma},
  journal={arXiv preprint arXiv:2604.22868},
  year={2026}
}