Qwen-Image-2.0: Professional infographics, exquisite photorealism

by tianqi 12 hours ago

I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the renowned host Tsai Kang-yong wore an elaborate outfit featuring a horse riding on his back[1]. At the time, he was embroiled in a rumor about his unpublicized homosexual partner, whose name sounded "Ma Qi Ren" which coincidentally translates to "horse riding man" in Mandarin. This incident spread widely across Chinese internet and turned into a meme. So they used "horse riding man" as an example isn't entirely nonsensical, though the image per se is undeniably bizarre and carries an unsettling vibe.

[1] The photo of the outfit: https://share.google/mHJbchlsTNJ771yBa

by vessenes 7 hours ago

Interesting background! Prompts like this also test the latent space of the image generator - it’s usually the other way round, so if you see a man on top of a horse, you’ve got a less sophisticated embedding feeding the model. In this case, though, that’s quite an image to put out to the interwebs. I looked to see what gender the horse was.

EDIT: After reading the prompt translation, this was more just like a “year of the horse is going to nail white engineers in glorious rendered detail” sort of prompt. I don’t know how SD1.5 would have rendered it, and I think I’ll skip finding out

by rahimnathwani an hour ago

This is fascinating!

From the article it seems the name is 马启仁, not 马骑人 so the guy's name sounds the same as 'horse riding man', but that's not a literal translation of his name.

by yorwba 11 hours ago

There's also the "horse riding astronaut" challenge in image generation: https://garymarcus.substack.com/p/horse-rides-astronaut-redu...

by laughingcurve 5 hours ago

Gary Marcus is not the man to be looking to on this topic

by AlphaAndOmega0 4 hours ago

Gary Marcus successfully predicted all ten of the one AI Winters.

He also claimed that LLMs were a failure because of prompts that GPT 3.5 couldn't parse, after the launch of GPT-4,which handled them with aplomb.

by yorwba 4 hours ago

Gary Marcus successfully wrote an article about getting image generation models to show a horse riding an astronaut, which is all I needed him to do. (Actually he wrote two, but this one felt more concise.) Take it as an existence proof, not an endorsement.

by observationist an hour ago

Just like a basilisk, if you never refer to him again, he fades away and doesn't bug people anymore. Let him fight through whatever he needs to if he ever bothers coming up with anything the rest of the world needs to hear; until then, we can enjoy the peace and quiet.

by Lerc 9 hours ago

On the topic of modern Chinese culture, Is there the same hostility towards AI generated Imagery in China as there seems to be in America?

For example I think there would be a lot of businesses in the US that would be too afraid of backlash to use AI generated imagery for an itinerary like the one at https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...

by tianqi 8 hours ago

Since China has a population of 1.4 billion people with vastly differing levels of cognition, I find it difficult to claim I can summarize "modern Chinese culture". But within my range of observation, no. Chinese not only have no hostility toward AI but actively pursues and reveres it with fervor. They widely perceive AI as an advanced force, a new opportunity for everyone, a new avenue for making money, and a new chance to surpass others. At most, some of the consumers might associate businesses using AI generated content with a budget-conscious brand image, but not hostile.

by Lerc 7 hours ago

>Since China has a population of 1.4 billion people with vastly differing levels of cognition, I find it difficult to claim I can summarize "modern Chinese culture"

Ha! An American would have no such qualms.

by idiotsecant 4 hours ago

Well, they would modify it slightly to claim "real American culture is..." In general, the range of 'real' America is about 300 miles, in my experience.

by yorwba 6 hours ago

There's definitely some hostility: https://mp.weixin.qq.com/s/A5shO-6nZIXZvJUEzrx03Q

by yieldcrv 4 hours ago

> "Why did such a strange metaphor like 'the sound of an electrocardiogram machine moving paper' appear in this story that had nothing to do with medicine?"

this is sending me, I don't know what's funnier, this translation being accurate or inaccurate

by cubefox 6 hours ago

While I don't doubt this was one influence, there was also an infamous problem with Dall-E 2, which was perfectly able to generate an astronaut riding a horse but completely unable to generate a horse riding an astronaut.

This problem is infamous because it persisted (unlike other early problems, like creating the wrong number of fingers) for much more capable models, and the Qwen Image people are certainly very aware of this difficult test. Even Imagen 4 Ultra, which might be the most advanced pure diffusion model without editing loop, fails at it.

And obviously an astronaut is similar to a man, which connects this benchmark to the Chinese meme.

by popalchemist 13 minutes ago

Super tone-deaf and inappropriate. Not realizing how it would read to the uninformed is a bad look. Myopic.

by TacticalCoder 37 minutes ago

Very interesting! What's weird though is that the chinese do not even pretend: every single picture has asian-looking people generated.

But on the one picture that honestly looks like a man getting ass-raped by a horse, it's a white man.

I mean even in the west where you can hardly see an ad with a white couple anymore, they don't go that far (at least not yet).

White people are a minority on earth and anti-white racism sure seems to be alive and well (btw my family is of all the colors and we speak three languages at home, so don't even try me).

by badhorseman 11 hours ago

Why not ask for simply a man or even an Han man given the race of Tsai Kang-yong. Why a white man and why a man wearing medieval clothing. Gives your head a wobble.

by DustinEchoes 9 hours ago

Yep, it’s the only image on the entire page with a non-Chinese person in it. Given the prompt, the message is clear.

by yorwba 6 hours ago

The message is "We watched Lord of the Rings and Game of Thrones and liked the medieval aesthetic enough to emulate it."

by badhorseman 5 hours ago

remind me of the bit of lord of the rings where muscular horses dominate European peasant men, as per the prompt translation.

by yorwba 4 hours ago

Yes, in those movies, the hot white guys (and sometimes girls) usually ride on top of the muscular horses. So when you want to show a horse riding a man as a visual gag, why not make the man a hot white guy with a gruff beard?

You act as though they first decided to make an image representing Westerners and then chose that particular scene as an intentional insult, but you need to consider that they likely made thousands of test images, most of which were just playing around with the model's capabilities and not specifically crafted for the announcement post.

So why did this one get picked? I think it boils down to the visual gag being funny and the movie-like quality.

by popalchemist 5 minutes ago

Racial/cultural tension is part of the context in which this image is appearing. Not only because of historical tensions, but because this image appears as part of this generation's Manhattan Project style arms race toward AGI and global dominance. Your denial of that is a reflection of your own ignorance.

by bogzz 2 hours ago

Fun fact, the Serbian parliament building has two statues of horses riding men in front of it.

Which is really apt because in Serbian "konj", or horse, is a colloquial word for moron. So, horses riding people is a perfect representation of the reality of the Serbian government.

Another fun fact, the parliament building in HL2's City 17 was modelled from that building.

by vunderba 7 hours ago

Couple of thoughts:

1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.

2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can run on much more modest GPUs. For reference, the original Qwen-Image is a 20B-parameter model.

3. This is a unified model (both image generation and editing), so there’s no need to keep separate Qwen-Image and Qwen-Edit models around.

4. The original Qwen-Image scored the highest among local models for image editing in my GenAI Showdown (6 out of 12 points), and it also ranked very highly for image generation (4 out of 12 points).

Generative Comparisons of Local Models:

https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt

Editing Comparison of Local Models:

https://genai-showdown.specr.net/image-editing?models=kxd,og...

I'll probably be waiting until the local version drops before adding Qwen-Image-2 to the site.

by SV_BubbleTime 6 hours ago

For the more technical…

Qwen 2512 (December edition of Qwen Image)

* 19B parameters, which was a 40GB file at FP16 and fit on a 3090 at FP8. Anything less than that and you were in GGUF format at Q6 to Q4 quantizations… which were slow, but still good quality.

* used Qwen 2.5 VL. So a large model and a very good vision model.

* And iirc, their own VAE. Which had known and obvious issues of high frequency artifacts. Some people would take the image and pass it through another VAE like WAN Video model’s or upscale-downscale to remove these

Qwen 2 now is

* a 7B param model. Right between Klein 9B (non-commercial) this (license unknown), Z-Image 7B (Apache), and Klein 4B (Apache). Direct competition, will fit on many more GPUs even at FP16.

* upgrades to Qwen 3 VL, I assume this is better than the already great 2.5 VL.

* Unknown on the new VAE. Flux2’s new 128 channel VAE is excellent, but it hasn’t been out long enough for even a frontier Chinese model to pick up.

Overall, you’re right this is on the trend to bring models on to lower end hardware.

Qwen was already excellent and now they rolled Image and Edit together for an “Omni” model.

Z-Image was the model to beat a couple weeks ago… and now it looks like both Klein and Qwen will! Z-Image has been disappointing to see how it just refuses to adhere to multiple new training concepts. Maybe they tried to pack it too tightly.

Open weights for this will be amazing. THREE direct competitors all vying to be “SDXL2” at the same time.

The Qwen convention was confusing! You had Image, 2509, Edit, 2511 (Edit), 2512 (Image) and then the Lora compatibility was unspecified. It’s smart to just 2.0 this mess.

by vunderba 5 hours ago

Agreed! A lot of people were also using ZiT as a refiner downstream to help with some of the more problematic visual aspects of the original Qwen-Image.

I'm really looking forward to running the unified model through its paces.

by SV_BubbleTime 3 hours ago

Something I am skeptical about Z-Image is that it uses Gemma which is imo a weak LLM.

If I were to guess, I would say that Z-Image’s life is shorter than it initially appeared. Even as a refiner which are just workarounds for model issues.

by liuliu 6 hours ago

Note that Qwen Image 1.0 (2512) wasted ~8B weights on timestep embedding. Both Z-Image / FLUX.2 series corrected that.

by raincole 13 hours ago

It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.

by gamma-interface 7 hours ago

The pace of commoditization in image generation is wild. Every 3-4 months the SOTA shifts, and last quarter's breakthrough becomes a commodity API.

What's interesting is that the bottleneck is no longer the model — it's the person directing it. Knowing what to ask for and recognizing when the output is good enough matters more than which model you use. Same pattern we're seeing in code generation.

by sincerely 2 hours ago

PLEASE STOP POSTING AI GENERATED COMMENTS

by echelon 3 hours ago

I'm happy the models are becoming commodity, but we still have a long way to go.

I want the ability to lean into any image and tweak it like clay.

I've been building open source software to orchestrate the frontier editing models (skip to halfway down), but it would be nice if the models were built around the software manipulation workflows:

https://getartcraft.com/news/world-models-for-film

by SV_BubbleTime 6 hours ago

SOTA shifts, yes. But the average person doing the work has been very happy with SDXL based models. And that was released two years ago.

The fight right now outside of API SOTA is who will replace SDXL to be the “community preference”

It’s now a three way between Flux2 Klein, Z-Image, and now Qwen2.

by rc1 2 hours ago

Isn’t it still? Antidotally, I work with lots of creators who still prefer it because of its subjective qualities.

by Mashimo 13 hours ago

What ever happend to midjourney?

by Lalabadie 8 hours ago

No external funding raised. They're not on the VC path, so no need to chase insane growth. They still have around 500M USD in ARR.

In my (very personal) opinion, they're part of a very small group of organizations that sell inference under a sane and successful business model.

by aenvoker 5 hours ago

Not on the VC path. Not even on the max-profit path. Just on the "Have fun doing cool research" path.

I was a mod on MJ for its first few years and got to know MJ's founder through discussions there. He already had "enough" money for himself from his prior sale of Leap Motion to do whatever he wanted. And, he decided what he wanted was to do cool research with fun people. So, he started MJ. Now he has far more money than before and what he wants to do with it is to have more fun doing more cool research.

by spaceman_2020 6 hours ago

Aesthetically, still unmatched

by echelon 3 hours ago

They're working on a few really lofty ideas:

1. real time world models for the "holodeck". It has to be fast, high quality, and inexpensive for lots of users. They started on this two years ago before "world model" hype was even a thing.

2. some kind of hardware to support this.

David Holz talks about this on Twitter occasionally.

Midjourney still has incredible revenue. It's still the best looking image model, even if it's hard to prompt, can't edit, and has artifacting. Every generation looks like it came out of a magazine, which is something the other leading commercial models lack.

by wongarsu 12 hours ago

They have image and video models that are nowhere near SOTA on prompt adherence or image editing but pretty good on the artistic side. They lean in on features like reference images so objects or characters have a consistent look, biasing the model towards your style preferences, or using moodboards to generate a consistent style

by vunderba 8 hours ago

A lot of people started realizing that it didn’t really matter how pretty the resulting image was if it completely failed to adhere to the prompt.

Even something like Flux.1 Dev which can be run entirely locally and was released back in August of 2024 has significantly better prompt understanding.

by raincole 13 hours ago

Not much, while everything happened at OpenAI/Google/Chinese companies. And that's the problem.

by KeplerBoy 12 hours ago

How is it a problem? There simply doesn't seem to be a moat or secret sauce. Who cares which of these models is SOTA? In two months there will be a new model.

by waldarbeiter 12 hours ago

There seems to be a moat like infrastructure/gpus and talent. The best models right now come from companies with considerable resources/funding.

by esperent 10 hours ago

Right, but that's a short term moat. If they pause on their incredible levels of spending for even 6 months, someone else will take over having spent only a tiny fraction of what they did. They might get taken over anyway.

by raincole 9 hours ago

> someone else will take over having spent only a tiny fraction of what they did

How. By magic? You fell for 'Deepseek V3 is as good as SOTA'?

by Gud 9 hours ago

By reverse engineering, sheer stupidity from the competition, corporate espionage, ‘stealing’ engineers and sometimes a stroke of genius, the same as it’s always been

by qingcharles 7 hours ago

They still have a niche. Their style references feature is their key differentiator now, but I find I can usually just drop some images of a MJ style into Gemini and get it to give me a text prompt that works just as well as MJ srefs.

by inanothertime 12 hours ago

I recently tried out LMStudio on Linux for local models. So easy to use!

What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.

by eurekin 10 hours ago

Practically anybody actually creating with this class of models (diffusion based mostly) is using ComfyUI. Community takes care of quantization, repackaging into gguf (most popular) and even speed optimizing (lighting loras, layers skip). It's quite extensive

by embedding-shape 12 hours ago

Everything keeps changing so quickly, I basically have my own Python HTTP server with a unified JSON interface, then that can be routed to any of the impls/*.py files for the actual generation, then I have of those per implementation/architecture basically. Mostly using `diffusers` for the inference, which isn't the fastest, but tends to have the new model architectures much sooner than everyone else.

by vunderba 8 hours ago

I encourage everyone to at least try ComfyUI. It's come a long way in terms of user-friendliness particularly with all of the built-in Templates you can use.

by guai888 12 hours ago

ComfyUI is the best for stable diffusion

by embedding-shape 11 hours ago

FWIW you can use non-sd models in ComfyUI too, the ecosystem is pretty huge and supports most of the "mainstream" models, not only the stable diffusion ones, even video models and more too.

by sequence7 8 hours ago

If you're on an AMD platform Lemonade (https://lemonade-server.ai/) added image generation in version 9.2 (https://github.com/lemonade-sdk/lemonade/releases/tag/v9.2.0).

by ilaksh 12 hours ago

I have my own MIT licensed framework/UI: https://github.com/runvnc/mindroot. With Nano Banana via runvnc/googleimageedit

by PaulKeeble 11 hours ago

Ollama is working on adding image generation but its not here yet. We really do need something that can run a variety of models for images.

by embedding-shape 11 hours ago

Yeah, I'm guessing they were bound to leave behind the whole "Get up and running with large language models" mission sooner or later, which was their initial focus, as investors after 2-3 years start making you to start thinking about expansion and earning back the money.

Sad state of affairs and seems they're enshittifying quicker than expected, but was always a question of when, not if.

by adammarples 6 hours ago

Stability matrix, it's a manager for models and uis and loras etc, very nice

by SV_BubbleTime 6 hours ago

LMStudio is a low barrier to entry for LLMs, for sure. The lowest. Good software!

Other people gave you the right answer, ComfyUI. I’ll give you the more important why and how…

There is a huge effort of people to do everything but Comfy because of its intimidating barrier. It’s not that bad. Learn it once and be done. You won’t have to keep learning UI of the week endlessly.

The how, go to civitai. Find an image you like, drag and drop it into comfy. If it has a workflow attached, it will show you. Install any missing nodes they used. Click the loaders to point to your models instead of their models. Hit run and get the same or a similar image. You don’t need to know what any of the things do yet.

If for some reason that just does not work for you… Swarm UI, is a front end too comfy. You can change things and it will show you on the comfy side what they’re doing. It’s a gateway drug to learning comfy.

EDIT: most important thing no one will tell you out right… DO NOT FOR ANY REASON try and skip the VENV or miniconda virtual environment when using comfy! You must make a new and clean setup. You will never get the right python, torch, diffusers, driver, on your system install.

by Eisenstein 7 hours ago

Koboldcpp has built in support for image models. Model search and download, one executable to run, UI, OpenAI API endpoint, llama.cpp endpoint, highly configurable. If you want to get up and running instantly, just pick a kcppt file and open that and it will download everything you need and load it for you.

Engine:

* https://github.com/LostRuins/koboldcpp/releases/latest/

Kcppt files:

* https://huggingface.co/koboldcpp/kcppt/tree/main

by sandbach 12 hours ago

The Chinese vertical typography is sadly a bit off. If punctuation marks are used at all, they should be the characters specifically designed for vertical text, like ︒(U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP).

by wiether 12 hours ago

I use gen-AI to produce images daily, but honestly the infographics are 99% terrible.

LinkedIn is filled with them now.

by smcleod 12 hours ago

To be fair it hasn't made LinkedIn any worse than it already was.

by nurettin 12 hours ago

To be fair, it is hard to make LinkedIn any worse.

by aenis 3 minutes ago

And at the same time, its arguably the least toxic of all social networks.

Yes, cringeworthy but at least not addictive! Its like facebook all those years ago, i can IM friends from highschool without having to pay any attention to the feed.

by embedding-shape 10 hours ago

I was gonna make a joke about "Wish granted, now Microsoft owns it" but then I remembered that they already do. Reality sometimes makes better jokes than what we can come up with.

by mdrzn 10 hours ago

Infographics and full presentations are a NanoBananaPro exclusive so far.

by RationPhantoms 2 hours ago

You should see some of the work from their PaperBanana papers. Really solid.

by viraptor 11 hours ago

Informatics are as bad as the author allows though. There's few people who could make or even describe a good infographic, so that's what we see in the results too.

by usefulposter 12 hours ago

Correct.

Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.

See Gas Town for non-Qwen examples of how bad it can get:

https://news.ycombinator.com/item?id=46746045

(Not commenting on the other results of this model outside of diagramming.)

by viraptor 11 hours ago

> cognitive slurry

Thank you for this phrase. I don't think that bad diagrams are limited to the AI in any way and this perfectly describes all "this didn't make things any clearer" cases.

by engcoach 7 hours ago

The "horse riding man" prompt is wild:

"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight. The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground. The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds. The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces."""

by badhorseman 7 hours ago

some context for the perplexed. https://live2makan.com/2024/08/07/treasures-statue-of-horse-...

by dsrtslnd23 13 hours ago

unfortunately no open weights it seems.

by embedding-shape 10 hours ago

To be fair, didn't they release open weights image model only like a ~month ago? Think last one was in December 2025.

by vunderba 7 hours ago

Exactly - they did the same thing with the original version of Qwen-Image. It was API only for a while before being made available for local hosting.

by thisisit 5 hours ago

I liked their comic panels example and tried it using their chat at: https://chat.qwen.ai/

When I used the exact prompt the post - the chat works. It gives me the exact output from the blog post.

Then I used Google Translate to understand the prompt format. The prompt is: A 4x6 panel comic, four lines, six panels per line. Each panel is separated by a white dividing line.

The first row, from left to right: Panel 1: Panel 2: .....

and when I try to change the inputs the comic example fails miserably. It keeps creating random grids - sometimes 4x5 other times 4x6 but then by third row the model will get confused and the output has only 3 panels. Other times English dialogue is replaced with Chinese dialogue. so, not very reliable in my books.

by cocodill 13 hours ago

interesting riding application picture

by rwmj 12 hours ago

"Guy being humped by a horse" wouldn't have been my first choice for demoing the capabilities of the model, but each to their own I guess.

by viraptor 11 hours ago

It looks like a marketing move. It's a good quality, detailed picture. It's going to get shared a lot. I would assume they knew exactly what they were doing. Nothing like a bit of controversy for extra clicks.

by brookst 10 hours ago

Because every ML researcher is a viral social media expert.

(I don’t even know if I’m being sarcastic)

by viraptor 10 hours ago

This is not some random ML researcher doing fun things at home. Qwen is backed by Alibaba cloud. They likely have whole departments of marketing people available.

by fguerraz 13 hours ago

I found the horse revenge-porn image at the end quite disturbing.

by engcoach 7 hours ago

It's the year of the horse in their zodiac. The (translated) prompt is wild:

""" A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight. The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground. The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds. The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces. """

by embedding-shape 12 hours ago

I think they call it "horse riding a human" which could have taken two very different directions, and the direction the model seems to have taken was the least worst of the two.

by wongarsu 12 hours ago

At first I thought it's a clever prompt because you see which direction the model takes it, and whether it "corrects" it to the more common "human riding a horse" similar to the full wine glass test.

But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.

"... A muscular, robust adult brown horse standing proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man ... and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat ... his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight ..."

by embedding-shape 11 hours ago

> But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.

Yeah, as they go through their workflow earlier in the blog post, that prompt they share there seems to be generated by a different input, then that prompt is passed to the actual model. So the workflow is something like "User prompt input -> Expand input with LLMs -> Send expanded prompt to image model".

So I think "human riding a horse" is the user prompt, which gets expanded to what they share in the post, which is what the model actually uses. This is also how they've presented all their previous image models, by passing user input through a LLM for "expansion" first.

Seems poorly thought out not to make it 100% clear what the actual humanly-written prompt is though, not sure why they wouldn't share that upfront.

by chakintosh 8 hours ago

Is it related to "Mr Hands" ?

by blitzar 11 hours ago

Wont someone think of the horses.

by skerit 13 hours ago

> Qwen-Image-2.0 not only accurately models the “riding” action but also meticulously renders the horse’s musculature and hair > https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...

What the actual fuck

by wongarsu 12 hours ago

For reference, below is the prompt translated (with my highlighting of the part that matters). They did very much ask for this version of "horse riding a man", not the "horse sitting upright on a crawling human" version

---

A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky.

Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight.

The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground.

The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds.

The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.

by badhorseman 11 hours ago

The significance of the hemp-rope is that it is symbol of morning and loss of ones decedent.

by embedding-shape 10 hours ago

I like how sometimes I get angry at a LLM for not understanding what I meant, but then I realize that I just forgot to mention it in the context. It's fun to see the same thing happen in humans reading websites too, where they don't understand the context yet react with strong feelings anyways.

by Deukhoofd 13 hours ago

The text rendering is quite impressive, but is it just me or do all these generated 'realistic' images have a distinctly uncanny feel to it. I can't quite put my finger on it what it is, but they just feel off to me.

by finnjohnsen2 13 hours ago

I agree. They makes me nauseous. The same kind of light nausea as car sickness.

I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.

I dont feel good staring at it now so I had to stop.

by jbl0ndie 12 hours ago

Sounds like you're describing the uncanny valley https://en.wikipedia.org/wiki/Uncanny_valley

by elorant 12 hours ago

The lighting is wrong, that's what's telling to me. They look too crisp. No proper shadows, everything looks crystal clear.

by techpression 12 hours ago

It’s the HDR era all over again, where people edited their photos to lack all contrast and just be ultra flat.

by brookst 10 hours ago

Everything is weightless. When real people stand and gesture there’s natural muscle use, hair and clothing drape, papers lay flat on surfaces.

by likium 13 hours ago

At least for the real life pictures, there’s no depth of field. Everything is crystal clear like it’s composited.

by derefr 13 hours ago

> like its composited

Like focus stacking, specifically.

I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.

by Mashimo 13 hours ago

I had no problems getting images with blurry background with the appropriate prompts. Something like "shallow depth of fields, bokeh, DSLR" can lead to good results. https://cdn.discordapp.com/attachments/1180506623475720222/1... [0]

But I found that that results in more professional looking images, and not more realistic photos.

Adding something like "selfy, Instagram, low resolution, flash" can lead to a .. worse image that looks more realistic.

[0] I think I did this one with z image turbo on my 4060 ti

by afro88 12 hours ago

The blur isn't correct though. Like the amount of blur is wrong for the distance, zoom amount etc. So the depth of field is really wrong even if it conforms to "subject crisp, background blurred"

by derefr 4 hours ago

Exactly.

My personal mechanistic understanding of diffusion models is that, "under the hood", the core thing they're doing, at every step and in every layer, is a kind of apophenia — i.e. they recognize patterns/textures they "know" within noise, and then they nudge the noise (least-recognizable pixels) in the image toward the closest of those learned patterns/textures, "snapping" those pixels into high-activation parts of their trained-in texture-space (with any text-prompt input just adding a probabilistic bias toward recognizing/interpreting the noise in certain parts of the image as belonging to certain patterns/textures.)

I like to think of these patterns/textures that diffusion models learn as "brush presets", in the Photoshop sense of the term: a "brush" (i.e. a specific texture or pattern), but locked into a specific size, roughness, intensity, rotation angle, etc.

Due to the way training backpropagation works (and presuming a large-enough training dataset), each of these "brush presets" that a diffusion model learns, will always end up learned as a kind of "archetype" of that brush preset. Out of a collection of examples in the training data where uses of that "brush preset" appear with varying degrees of slightly-wrong-size, slightly-wrong-intensity, slightly-out-of-focus-ness, etc, the model is inevitably going to learn most from the "central examples" in that example cluster, and distill away any parts of the example cluster that are less shared. So whenever a diffusion model recognizes a given one of its known brush presets in an image and snaps pixels toward it, the direction it's moving those pixels will always be toward that archetypal distilled version of that brush preset: the resultant texture in perfect focus, and at a very specific size, intensity, etc.

This also means that diffusion models learn brushes at distinctively-different scales / rotation angles / etc as entirely distinct brush presets. Diffusion models have no way to recognize/repair toward "a size-resampled copy of" one of their learned brush presets. And due to this, diffusion models will never learn to render in details small enough that the high-frequency components of of their recognizable textural-detail would be lost below the Nyquist floor (which is why they suck so much at drawing crowds, tiny letters on signs, etc.) And they will also never learn to recognize or reproduce visual distortions like moire or ringing, that occur when things get rescaled to the point that beat-frequencies appear in their high-frequency components.

Which means that:

- When you instruct a diffusion model that an image should have "low depth-of-field", what you're really telling it is that it should use a "smooth-blur brush preset" to paint in the background details.

- And even if you ask for depth-of-field, everything in what a diffusion model thinks of as the "foreground" of an image will always have this surreal perfect focus, where all the textures are perfectly evident.

- ...and that'll be true, even when it doesn't make sense for the textures to be evident at all, because in real life, at the distance the subject is from the "camera" in the image, the presumed textures would actually be so small as to be lost below the Nyquist floor at anything other than a macro-zoom scale.

These last two problems combine to create an effect that's totally unlike real photography, but is actually (unintentionally) quite similar to how digital artists tend to texture video-game characters for "tactile legibility." Just like how you can clearly see the crisp texture of e.g. denim on Mario's overalls (because the artist wanted to make it feel like you're looking at denim, even though you shouldn't be able to see those kinds of details at the scaling and distance Mario is from the camera), diffusion models will paint anything described as "jeans" or "denim" as having a crisply-evident denim texture, despite that being the totally wrong scale.

It's effectively a "doll clothes" effect — i.e. what you get when you take materials used to make full-scale clothing, cut tiny scraps of those materials to make a much smaller version of that clothing, put them on a doll, and then take pictures far closer to the doll, such that the clothing's material textural detail is visibly far larger relative to the "model" than it should be. Except, instead of just applying to the clothing, it applies to every texture in the scene. You can see the pores on a person's face, and the individual hairs on their head, despite the person standing five feet away from the camera. Nothing is ever aliased down into a visual aggregate texture — until a subject gets distant enough that the recognition maybe snaps over to using entirely different "brush preset" learned specifically on visual aggregate textures.

by vunderba 7 hours ago

Which is pretty amusing - because it's the exact opposite problem that BFL had with the original Flux model - every single image looked like it was taken with a 200mm f/4.

by albumen 13 hours ago

Every photoreal image on the demo page has depth of field, it’s just subtle.

by BoredPositron 13 hours ago

Qwen always suffered from their subpar rope implementation and qwen 2 seems to suffer from it as well. The uncanny feel is down to the sparsity of text to image token and the higher in resolution you go the worse it gets. It's why you can't take the higher ends of the MP numbers serious no matter the model. At the moment there is no model that can go for 4k without problems you will always get high frequency artifacts.

by belter 13 hours ago

Agree, looks like the same effect they are applying on YouTube Shorts...

by GaggiX 13 hours ago

For me the only model that can really generate realistic images is nano banana pro (also known as gemini-3-pro-image). Other models are closing the gap, this one is pretty meh in my opinion in realistic images.

by Mashimo 13 hours ago

You can get flux and maybe z-image to do so, but you have to experiment with the promt a bit. Or maybe get an LoRa to help.

by cubefox 12 hours ago

The examples I saw of z-image look much more realistic than Nano Banana Pro, which is likely using Imagen 4 (plus editing) internally, which isn't very realistic. But Nano Banana Pro has obviously much better prompt alignment than something like z-image.

by GaggiX 12 hours ago

Are you sure you are not confusing nano banana pro for nano banana, z-image still has a bit of AI look that I do not find with nano banana pro, example for a comparison: https://i.ibb.co/YFtxs4hv/594068364-25101056889517041-340369...

Also Imagen 4 and Nano Banana Pro are very different models.

by cubefox 9 hours ago

In your example, z-image and Nano Banana Pro look basically equally photorealistic to me. Perhaps the NBP image looks a bit more real because it resembles an unstaged smartphone shot with wide angle. Anyway, the difference is very small. I agree the lighting in Flux.2 Pro looks a bit off.

But anyway, realistic environments like a street cafe are not suited to test for photorealism. You have to use somewhat more fantastical environments.

I don't have access to z-image, but here are two examples with Nano Banana Pro:

"A person in the streets of Atlantis, portrait shot." https://i.ibb.co/DgMXzbxk/Gemini-Generated-Image-7agf9b7agf9...

"A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/nN7cTzLk/Gemini-Generated-Image-l1fm5al1fm5...

These are terribly unrealistic. Far more so than the Flux.2 Pro image above.

> Also Imagen 4 and Nano Banana Pro are very different models.

No, Imagen 4 is a pure diffusion model. Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment. The prompts above a very simple, so there is little for Gemini to alter, so they look basically identical to plain Imagen 4. Both pictures (especially the first) have the signature AI look of Imagen 4, which is different from other models like Imagen 3.

By the way, here is GPT Image 1.5 with the same prompts:

"A person in the streets of Atlantis, portrait shot." https://i.ibb.co/Df8nDHFL/Chat-GPT-Image-10-Feb-2026-14-17-1...

"A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/Nns4pdGX/Chat-GPT-Image-10-Feb-2026-14-17-2...

The first is very fake and the second is a strong improvement, though still far from the excellent cafe shots above (fake studio lighting, unrealistic colors etc).

by GaggiX 9 hours ago

>In your example, z-image and Nano Banana Pro look basically equally photorealistic to me

I disagree, nano banana pro result is on a completely different league compare to flux.2 and z-image.

>But anyway, realistic environments like a street cafe are not suited to test for photorealism

Why? It's the perfect settings in my opinion.

Btw I don't think you are using nano banana pro, probably standard nano banana, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg

>Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment.

First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).

by cubefox 8 hours ago

> I disagree, nano banana pro result is on a completely different league.

I strongly disagree. But even if you are right, the difference between the cafe shots and the Atlantis shots is clearly much, much larger than the difference between the different cafe shots. The Atlantis shots are super unrealistic. They look far worse than the cafe shots of Flux.2 Pro.

> Why? It's the perfect settings in my opinion

Because it's too easy obviously. We don't need an AI to make fake realistic photos of realistic environments when we can easily photograph those ourselves. Unrealistic environments are more discriminative because they are much more likely to produce garbage that doesn't look photorealistic.

> Btw I don't think you are using nano banana pro, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg

I'm definitely using Nano Banana Pro, and your picture has the same strong AI look to it that is typical of NBP / Imagen 4.

> First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).

There were discussions about it previously on HN. Clearly NBP is using Gemini reasoning, and clearly the style of NBP strongly resembles Imagen 4 specifically. There is probably also a special editing model involved, just like in Qwen-Imahe-2.0.

by GaggiX 7 hours ago

>Because it's too easy obviously.

Still the vast majority of models fail at delivery an image that looks real, I want realism for a realistic settings, if it can't do that than what's the point. Of course you can always pay people and equipment to make the perfect photo for you ahah

If the image of z-image turbo looks as good as the nano banana pro one for you, you are probably too used to slop that a model that do not produce obvious artifacts like super shiny skin it's immediately undistinguishable from a real image (like the nano banana pro one that to me looks as real as a real photo) and yes I'm ignoring the fact that in the z-image-turbo the cup is too large and the bag is inside the chair. Z-image is good (in particular given its size) but not as good.

by cubefox 7 hours ago

It seems you are ignoring the fact that the NBP Atlantis pictures looks much, much worse than the z-image picture of the cafe. They look far more like AI slop. (Perhaps the Atlantis prompt would look even worse with z-image, I don't know.)

by GaggiX 7 hours ago

I have generated my own using your prompt and post it in the previous comment. You haven't posted a z-image one of Atlantis. I'm not at home to try but I have trained lora for z-image (it's a relatively lightweight model), I know the model, it's not as good as nano banana pro. Use what you prefer.

by cubefox 7 hours ago

> I have generated my own using your prompt and post it in the previous comment.

Yes, and it has a very unrealistic AI look to it. That was my point.

> You haven't posted a z-image one of Atlantis.

Yes, I don't doubt that it might well be just as unrealistic or even worse. I also just tried the Atlantis prompts in Grok (no idea what image model they use internally) and they look somewhat more realistic, though not on cafe level.

by ranger_danger 6 hours ago

When I tried Qwen-Image-2512 I could not even get it to spell correctly. And often the letters would be garbled anyways.

by cubefox 12 hours ago

The complex prompt following ability and editing is seriously impressive here. They don't seem to be much behind OpenAI and Google. Which is backed op by the AI Arena ranking.

by goga-piven 12 hours ago

Why is the only image featuring non-Asian men the one under the horse?

by z3dd 12 hours ago

they explicitly called for that in the prompt

by goga-piven 12 hours ago

Exactly why did they choose this prompt with a white person and not an Asian person, as in all the other examples?

by wtcactus 12 hours ago

But why? That image actually puzzled me. Does it have some background context? Some historical legend or something of the like?

by joeycodes 12 hours ago

It is Lunar New Year season right now, 2026 is year of the horse, there is celebratory horse imagery everywhere in many Asian countries right now, so this image could be interpreted as East trampling West. I have no way to know the intention of the person at Qwen who wrote this, but you can form your own conclusions from the prompt:

A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male...

by andruby 12 hours ago

Is the problem the position/horse or that Qwen mostly shows asian people?

Do western AI models mostly default to white people?

by goga-piven 12 hours ago

Well, what if some western models showcase white people in all good-looking images and the only embarrassing image features Asian people? wouldn't that be considered racism?

by embedding-shape 10 hours ago

> and the only embarrassing image

Embarrassing image? I'm white, why would I be embarrassed over that image? It's a computer generated image with no real people in it, how could it be embarrassing for alive humans?

by modzu 4 hours ago

image generation kind of reminds me of video games or any cgi in general.. the progress is undeniable, and yet with every milestone it seems the last gap to "photorealism" is infinitely wide

by engcoach 7 hours ago

My response to the horse image: https://i.postimg.cc/hG8nJ4cv/IMG-5289-copy.jpg

by wtcactus 7 hours ago

So, I've just gave it this prompt:

"Analyze this webpage: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests...

Generate an infographic with all the data about the main event timeline and estimated number of victims.

The background image should be this one: https://en.wikipedia.org/wiki/Tank_Man#/media/File :Tank_Man_(Tiananmen_Square_protester).jpg

Improve the background image clarity and resolution."

I've received an error:

"Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input file data may contain inappropriate content."

I wonder if locally running the model they published in December does have the same censorship in place (i.e. if it's already trained like this), or if they implement the censorship by the Chinese regimen in place for the web service only.

by yieldcrv 12 hours ago

when the horsey tranq hits

by singularfutur 12 hours ago

Another closed model dressed up as "coming soon" open source. The pattern is obvious: generate hype with a polished demo, lock the weights, then quietly move on. Real open source doesn't need a press release countdown.

by vunderba 7 hours ago

That's not what they did with Qwen-Image v1 - they announced it and it was available via API, but then they released the weights a few weeks after with an Apache 2.0 license. Let's at least give them the benefit of the doubt here.

by kkzz99 11 hours ago

Good that we have the arbitrator of what "real open source" is and isn't over here.

by yellowapple 7 hours ago

“Open source” is indeed an objective standard with actual criteria, and not just vibes.

Luckily, it seems previous Qwen models did get open-sourced in the actual sense, so this one probably will be, too.

by kkzz99 6 hours ago

Its one NGO trying to dictate what "open source" is and btw. according to that definition, Qwen isn't open source.

by yorwba 10 hours ago

Where do you see a press release countdown? Alibaba consistently doesn't release weights for their biggest models, but they also don't pretend that they do.

by singularfutur 8 hours ago

fair enough, I read too quickly the article.

Qwen-Image-2.0: Professional infographics, exquisite photorealism (qwen.ai)

142 comments: