Parallel Diffusion, Minus the Waiting Room

Guess how many denoising steps you need before a diffusion model stops producing expensive fog and starts producing an actual image. Twenty? Wrong. This paper shows that with the right solver, 20 steps can beat a 28-step baseline on Stable Diffusion 3 Medium, which is the kind of result that makes GPU fans spin a little faster out of respect.¹

Diffusion models are great at making images. They are also great at taking their sweet time. The usual routine is painfully serial: start with noise, clean it a bit, repeat, repeat, repeat. It works. It also feels like watching a very talented painter insist on using one eyelash hair per brushstroke.

The Bottleneck Is Not the Model. It Is the March.

The paper, Parallel Diffusion Solver via Residual Dirichlet Policy Optimization, targets the sampler, not the giant image model behind it.¹ That matters.

A diffusion model can be viewed as following a path through an ordinary differential equation, or ODE, from noise to image.²³ Fast samplers already try to take bigger steps along that path. The problem is that big steps cut corners. If the path bends sharply, quality drops. You get blur, weird textures, and that familiar "almost a dog, somehow also a croissant" energy.

The authors propose EPD-Solver, short for Ensemble Parallel Direction solver. The trick is simple to say and annoying to invent: instead of trusting one gradient estimate at each step, use several in parallel, then combine them.¹ Those extra evaluations are independent, so modern hardware can run them at the same time. Same low-latency spirit, better estimate of where the path actually goes.

That idea leans on a geometric hunch: image trajectories mostly live on a lower-dimensional manifold, not in the full chaos-space of all possible pixels.¹⁴ In plain English, the path from noise to "golden retriever in a spacesuit" wiggles less wildly than the raw math budget suggests. So a few smart parallel peeks can beat one blind lunge.

Tiny Policy, Big Attitude

Then the paper gets sneakier.

After a distillation stage, the authors fine-tune the solver with reinforcement learning. Not the whole diffusion backbone. Just the tiny solver parameter space.¹ They frame the solver as a Dirichlet policy, which is a tidy way to learn how to weight those multiple directions without handing the model a flamethrower and calling it optimization.³⁵

That small-space RL move is one of the paper's best ideas. Most "let's improve generation with RL" stories carry a whiff of chaos. Reward hacking lurks nearby like a raccoon near unsecured trash. Here, the search space is small, structured, and much harder to abuse.

What It Actually Buys You

The headline numbers are not subtle. At 5 function evaluations, the distilled EPD-Solver reports FID scores of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom.¹ On text-to-image tasks, the RL-tuned version improves human preference scores on Stable Diffusion v1.5 and SD3-Medium, and beats the official 28-step SD3-Medium baseline with only 20 steps.¹

That is the real point. Faster sampling usually means uglier images. This paper tries to dodge that trade. Not by retraining a monster model from scratch. By making the numerical solver less clumsy.

It also fits a broader trend. Researchers have been attacking diffusion latency from several angles: better ODE solvers like DPM-Solver,⁶⁷ parallel samplers like ParaDiGMS,⁸ stronger parallel acceleration like ParaTAA,⁹ and even one-step or few-step alternatives like Consistency Models.¹⁰ EPD-Solver sits in the middle of that fight. It does not replace diffusion. It makes diffusion less of a parking violation.

Why You Might Care Outside a Paper PDF

Cheaper sampling means more than quicker anime frogs.

It means local generation becomes more realistic on consumer hardware. On July 22, 2025, AMD and Stability AI announced an SD3 Medium variant tailored for Ryzen AI laptops, which tells you exactly where the industry wants this to go: closer to the device, less cloud babysitting.¹¹ Faster samplers also matter for image enhancement pipelines, where users notice lag immediately. Tools like combb2.io live in that world. If an image model takes forever to sharpen, denoise, or upscale, the magic wears off fast.

There is also an open-source angle. Hugging Face Diffusers exposes schedulers as swappable components,⁷ and DPM-Solver has long had public code and broad integration.⁶ That makes papers like this more than theory bait. If the method holds up, it has a plausible route into real tooling.

The Catch, Because There Is Always One

This is still a sampler paper. It improves the ride, not the car. It does not fix bad prompts, sketchy training data, or the strange fact that image models sometimes render hands like they were assembled during an earthquake.

And while the results are strong, the deeper claim is narrower than hype would suggest: parallel gradient estimates can reduce truncation error in low-step diffusion sampling, and a tiny RL-tuned policy can help pick the mix.¹ That is already enough. No need to dress it up in prophecy robes.

The nicest thing about this work is its restraint. Fewer parameters. Better trajectories. Lower latency. Same big diffusion model, just less dawdling. Read that again.

References

Wang R, Li Z, Zhu B, et al. Parallel Diffusion Solver via Residual Dirichlet Policy Optimization. IEEE TPAMI, 2026. DOI: 10.1109/TPAMI.2026.3692227. PubMed: PMID 42118644. arXiv: 2512.22796 ↩
Wikipedia. Diffusion model. https://en.wikipedia.org/wiki/Diffusion_model ↩
Wikipedia. Ordinary differential equation. https://en.wikipedia.org/wiki/Ordinary_differential_equation ↩
Zhang T, Wang Z, Huang J, Tasnim MM, Shi W. A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions. arXiv: 2308.13142 ↩
Wikipedia. Dirichlet distribution. https://en.wikipedia.org/wiki/Dirichlet_distribution ↩
Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv: 2206.00927. Official code: https://github.com/LuChengTHU/dpm-solver ↩
Hugging Face Diffusers documentation. Schedulers overview. https://huggingface.co/docs/diffusers/v0.22.1/en/api/schedulers/overview ↩
Shih A, Belkhale S, Ermon S, Sadigh D, Anari N. Parallel Sampling of Diffusion Models. arXiv: 2305.16317 ↩
Tang Z, Tang J, Luo H, Wang F, Chang TH. Accelerating Parallel Sampling of Diffusion Models. arXiv: 2402.09970 ↩
Song Y, Dhariwal P, Chen M, Sutskever I. Consistency Models. arXiv: 2303.01469 ↩
Coldewey? No, specific report used here: Shilov A. "AMD unveils industry-first Stable Diffusion 3.0 Medium AI model generator tailored for XDNA 2 NPUs." Tom's Hardware, July 22, 2025. https://www.tomshardware.com/tech-industry/artificial-intelligence/amd-unveils-industry-first-stable-diffusion-3-0-medium-ai-model-generator-tailored-for-xdna-2-npus-designed-to-run-locally-on-ryzen-ai-laptops ↩

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded