4o Image Generation
479 comments
·March 25, 2025blixt
Taek
> What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
I do not think that this is correct. Prior to this release, 4o would generate images by calling out to a fully external model (DALL-E). After this release, 4o generates images by calling out to a multi-modal model that was trained alongside it.
You can ask 4o about this yourself. Here's what it said to me:
"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
noosphr
>You can ask 4o about this yourself. Here's what it said to me:
>"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
Models don't know anything about themselves. I have no idea why people keep doing this and expecting it to know anything more than a random con artist on the street.
Xmd5a
>Models don't know anything about themselves.
They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore.
>Like other AI models, I’m trained on diverse, legally compliant data sources, but not on proprietary outputs from models like ChatGPT-4. DeepSeek adheres to strict ethical and legal standards in AI development.
mgraczyk
This is overly cynical. Models typically do know what tools they have access to because the tool descriptions are in the prompt. Asking a model which tools it has is a perfectly reasonable way of learning what is effectively the content of the prompt.
Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
rickyhatespeas
You're incorrect. 4o was not trained on knowledge of itself so literally can't tell you that. What 4o is doing isn't even new either, Gemini 2.0 has the same capability.
Taek
Can you provide a link or screenshot that directly backs this up?
teaearlgraycold
The system prompt includes instructions on how to use tools like image generation. From that it could infer what the GP posted.
CooCooCaCha
Models are famously good at understanding themselves.
uh_uh
I hope you're joking. Sometimes they don't even know which company developed them. E.g. DeepSeek was claiming it was developed by OpenAI.
mgraczyk
I think this is actually correct even if the evidence is not right.
See this chat for example:
https://chatgpt.com/share/67e355df-9f60-8000-8f36-874f8c9a08...
low_tech_love
Honest question, do you believe something just because the bot tells you that?
sureIy
> truly generative UI, where the model produces the next frame of the app
Please sir step away from the keyboard now!
That is an absurd proposition and I hope I never get to use an app that dreams of the next frame. Apps are buggy as they are, I don't need every single action to be interpreted by LLM.
An existing example of this is that AI Minecraft demo and it's a literal nightmare.
koliber
First it will dream up the interaction frame by frame. Next, to improve efficiency, it will cache those interaction representations. What better way to do that than through a code representation.
While I think current AI can’t come close to anything remotely usable, this is a plausible direction for the future. Like you, I shudder.
blixt
This argument could be made for every level of abstraction we've added to software so far... yet here we are commenting about it from our buggy apps!
outworlder
Yeah, but the abstractions have been useful so far. The main advantage of our current buggy apps is that if it is buggy today, it will be exactly as buggy tomorrow. Conversely, if it is not currently buggy, it will behave the same way tomorrow.
I don't want an app that either works or does not work depending on the RNG seed, prompt and even data that's fed to it.
That's even ignoring all the absurd computing power that would be required.
FridgeSeal
Please, I don’t need my software experience to get any _worse_. It’s already a shitshow.
jjbinx007
It still can't generate a full glass of wine. Even in follow up questions it failed to manipulate the image correctly.
meeton
https://i.imgur.com/xsFKqsI.png
"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."
Stevvo
Can't replicate. Maybe the rollout is staggered? Using Plus from Europe, it's consistently giving me a half full glass.
eitland
Most interesting thing to me is the spelling is correct.
I'm not a heavy user of AI or image generation in general, so is this also part of the new release or has this been fixed silently since last I tried?
dghlsakjg
The head of foam on that glass of wine is perfect!
cruffle_duffle
Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.
I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.
yusufozkan
Are you sure you are using the new 4o image generation?
minimaxir
That is an unexpectedly literal definition of "full glass".
Imustaskforhelp
Looks amazing,can you please also create a unconventional image like the clock at 2:35 , I tried it something like this with gemini when some redditor asked it and it failed so wondering if 4o does do it
stevesearer
Can you do this with the prompt of a cow jumping over the moon?
I can’t ever seem to get it to make the cow appear to be above the moon. Always literally covering it or to the side etc.
null
tobr
Also still seems to have a hard time consistently drawing pentagons. But at least it does some of the time, which is an improvement since last time I tried, when it would only ever draw hexagons.
sfjailbird
They're glass-half-full type models.
blixt
Yeah, it seems like somewhere in the semantic space (which then gets turned into a high resolution image using a specialized model probably) there is not enough space to hold all this kind of information. It becomes really obvious when you try to meaningfully modify a photo of yourself, it will lose your identity.
For Gemini it seems to me there's some kind of "retain old pixels" support in these models since simple image edits just look like a passthrough, in which case they do maintain your identity.
HellDunkel
I think it is not the AI but you who is wrong here. A full glass of wine is filled only up to the point of max radius so that the surface to air is maxed an the wine can breathe. This is what we taught the AI to consider „a full glass of wine“ and it perfectly gets it right.
jasonjmcghee
I don't buy the meme or w/e that they can't produce an image with the full glass of wine. Just takes a little prompt engineering.
Using Dall-e / old model without too much effort (I'd call this "full".)
ASalazarMX
The true test was "full to the brim", as in almost overflowing.
xg15
> What's important about this new type of image generation that's happening with tokens rather than with diffusion
That sounds really interesting. Are there any write-ups how exactly this works?
fpgaminer
There are a few different approaches. Meta documents at least one approach quite well in one of their llama papers.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.
zaptrem
Token by token diffusion was done by MAR https://arxiv.org/abs/2406.11838 and Fluid (scaled up MAR) https://arxiv.org/abs/2410.13863
When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.
og_kalu
There is a more recent approach to auto-regressive image generation. Rather than predicting the next patch at the target resolution one by one, it predicts the next resolution. That is, the image at a small resolution followed by the image at a higher resolution and so on.
lyu07282
Would be interested to know as well. As far as I know there is no public information about how this works exactly. This is all I could find:
> The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. Goh speculates that this technical difference could be what gives Images in ChatGPT better text rendering and binding capabilities.
https://www.theverge.com/openai/635118/chatgpt-sora-ai-image...
treis
I wonder how it'd work if the layers were more physical based. In other words something like rough 3d shape -> details -> color -> perspective -> lighting.
Also wonder if you'd get better results in generating something like blender files and using its engine to render the result.
astrange
DALL-E was an autoregressive encoder; it's 2 and 3 that used diffusion and were much less intelligent as a result.
nine_k
It also would mean that the model can correctly split the image into layers, or segments, matching the entities described. The low-res layers can then be fed to other image-processing models, which would enhance them and fill in missing small details. The result could be a good-quality animation, for instance, and the "character" layers can even potentially be reusable.
DeathArrow
>You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
You can do that with diffusion, too. Just lock the parameters in ComfyUi.
blixt
Yeah I wasn’t very imaginative in my examples, with 4o you can also perform transformations like “rotate the camera 10 degrees to the left” which would be hard without a specialized model. Basically you can run arbitrary functions on the exact image contents but in latent space.
snickell
> truly generative UI, where the model produces the next frame of the app
I built this exact thing last month, demo: https://universal.oroborus.org (not viable on phone for this demo, fine on tablet or computer)
Also see discussion and code at: http://github.com/snickell/universal
I wasn't really planning to share/release it today, but, heck, why not.
I started with bitmap-style generative image models, but because they are still pretty bad at text (even this, although it’s dramatically better), for early-2025 it’s generating vector graphics instead. Each frame is an LLM response, either as an svg or static html/css. But all computation and transformation is done by the LLM. No code/js as an intermediary. You click, it tells the LLM where you clicked, the LLM hallucinates the next frame as another svg/static-html.
If it ran 50x faster it’d be an absolutely jaw dropping demo. Unlike "LLMs write code", this has depth. Like all programming, the "LLMs write code" model requires the programmer or LLM to anticipate every condition in advance. This makes LLM written "vibe coded" apps either gigantic (and the llm falls apart) or shallow.
In contrast, as you use universal, you can add or invent features ranging from small to big, and it will fill in the blanks on demand, fairly intelligently. If you don't like what it did, you can critique it, and the next frame improves.
Its agonizingly slow in 2025, but much smarter and in weird ways less error prone than using the LLM to generate code that you then run: just run computation via the LLM itself.
You can build pretty unbelievable things (with hallucinated state, granted) with a few descriptive sentences, far exceeding the capabilities you can “vibe code” with the description. And it never gets lost in its rats nest of self generated garbage code because… there is no code to in.
Code is medium with a surprisingly strong grain. This demo is slow, but SO much more flexible and personally adaptable than anything I’ve used where the logic is implemented cia a programming language.
I don’t love this as a programmer, but my own use of the demo makes me confident that programming languages as a category will have a shelf life if LLM hardware gets fast, cheap and energy efficient.
I suspect LLMs will generate not programming language code, but direct wasm or just machine code on the fly for things that need faster traction than they can draw a frame, but core logic will move out of programming languages (not even llm written code). Maybe similar to the way we bind to low level fast languages but a huge percentage of “business” logic is written in relatively slower languages.
FYI, I may not be able to afford the credits if too many people visit, I put a a $1000 of credits on this, we'll see if that lasts. This is claude 3.7, I tried everything else, a claude had the visual intelligence today. IMO this is a much more compelling glance at the future than coding models. Unfortunately, generating an SVG per click is pricey, each click/frame costs me about $0.05. I’ll fund this as far as I can so folks can play with it.
Anthropic? You there? Wanna throw some credits at an open source project doing something that literally only works on claude today? Not just better, but “only Claude 3.7 can show this future today?”. I’d love for lots more people to see the demo, but I really could use an in-kind credit donation to make this viable. If anyone at anthropic is inspired and wants to hook me up: snickell@alumni.stanford.edu. Very happy to rep Claude 3.7 even more than I already do.
I think it’s great advertising for Claude. I believe the reason Claude seems to do SO much better at this task is, one it shows far greater spatial intelligence, and two, I distract they are the only state of the art model intentionally training on SVG.
blixt
This is super cool! I think new kinds of experiences can be built with infinite generative UIs. Obviously there will need to be good memory capabilities, maybe through tool use.
If you end up taking this further and self hosting a model you might actually achieve a way faster “frame rate” with speculative decoding since I imagine many frames will reuse content from the last. Or maybe a DSL that allows big operations with little text. E.g. if it generates HTML/SVG today then use HAML/Slim/Pug: https://chatgpt.com/share/67e3a633-e834-8003-b301-7776f76e09...
snickell
What I'm currently doing is caveman: I ask the LLM to attach a unique id= to every element, and I gave it an attribute (data-use-cached) it can use to mark "the contents of this element should be loaded from the preivous frame": https://github.com/snickell/universal/blob/47c5b5920db5b2082...
For example, this specifies that #my-div should be replaced with the value from the previous frame (which itself might have been cached): <div id="my-div" data-use-cached></div>
This lowers the render time /substantially/, for simple changes like "clicked here, pop-open a menu" it can do it in 10s, vs a full frame render which might be 2 minutes (obviously varies on how much is on the screen!).
I think using HAML etc is an interesting idea, thanks for suggesting it, that might be something I'll experiment with.
The challenge I'm finding is that "fancy" also has a way of confusing the LLM. E.g. I originally had the LLM produce literal unified diffs between frames. I reasoned it had seem plenty of diffs of HTML in its training data set. It could actually do this, BUT image quality and intelligence were notably affected.
Part of the problem is that at the moment (well 1mo ago when I last benchmarked), only Claude is "past the bar" for being able to do this particular task, for whatever reason. Gemini Flash is the second closest. Everything else (including 4o, 4.5, o1, deepseek, etc) are total wipeouts.
What would be really amazing is if say Llama 4 turns out to be good in the visual domain the way claude is, and you can run it on one of the LLM-on-silicon vendors (cerebrus.ai, grok, etc) to get 10x the token rate.
LMK if you have other ideas, thanks for thinking about this and taking a look!
koliber
It’s like a lucid dream version of using and modifying the software at the same time.
Retr0id
Do you have any demo videos?
snickell
No, I wasn't planning to post this for a couple weeks, but I saw the comment and was like "eh, why not?".
You can watch "sped up" past sessions by other people who used this demo here, which is kind of like a demo video: https://universal.oroborus.org/gallery
But the gallery feature isn't really there today, it shows all the "one-click and bounce sessions", and its hard to find signal in the noise.
I'll probably submit a "Show HN" when I have the gallery more together, and I think its a great idea to pick a multi-click gallery sequence and upload it as a video.
Mond_
Pretty sure the modern Gemini image models can already do token based image generation/editing and are significantly better and faster.
blixt
Yeah Gemini has had this for a few weeks, but much lower resolution. Not saying 4o is perfect, but my first few images with it are much more impressive than my first few images with Gemini.
null
yieldcrv
weeks, ya'll, weeks!
og_kalu
It's faster but it's definitely not better than what's being showcased here. The quality of Flash 2 Image gens are generally pretty meh.
vunderba
Ran through some of my relatively complex prompts combined with using pure text prompts as the de-facto means of making adjustments to the images (in contrast to using something like img2img / inpainting / etc.)
https://mordenstar.com/blog/chatgpt-4o-images
It's definitely impressive though once again fell flat on the ability to render a 9-pointed star.
jimbo_joe
Didn't work for me on the first prompting (got a 10-pointed one), but after sending [this is 10 points, make it 9] it did render a 9-pointed one too
M4v3R
I’ve just tried it and oh wow it’s really good. I managed to create a birthday invitation card for my daughter in basically 1-shot, it nailed exactly the elements and style I wanted. Then I asked to retain everything but tweak the text to add more details about the date, venue etc. And it did. I’m in shock. Previous models would not be even halfway there.
swyx
share prompt minus identifying details?
M4v3R
> Draw a birthday invitation for a 4 year old girl [name here]. It should be whimsical, look like its hand-drawn with little drawings on the sides of stuff like dinosaurs, flowers, hearts, cats. The background should be light and the foreground elements should be red, pink, orange and blue.
Then I asked for some changes:
> That's almost perfect! Retain this style and the elements, but adjust the text to read:
> [refined text]
> And then below it should add the location and date details:
> [location details]
kh_hk
> Introducing 4o Image Generation: [...] our most advanced image generator yet
Then google:
> Gemini 2.5: Our most intelligent AI model
> Introducing Gemini 2.0 | Our most capable AI model yet
I could go on forever. I hope this trend dies and apple starts using something effective so all the other companies can start copying a new lexicon.
roenxi
We're in the middle of a massive and unprecedented boom in AI capabilities. It is hard to be upset about this phrasing - it is literally true and extremely accurate.
kh_hk
If that's so then there's no need to be hyperbolic about it. Why would they publish a model that is not their most advanced model?
roenxi
Most things aren't in a massive boom and most people aren't that involved in AI. This is a rare example of great communication in marketing - they're telling people who might not be across this field what is going on.
> Why would they publish a model that is not their most advanced model?
I dunno, I'm not sitting in the OpenAI meetings. That is why they need to tell us what they are doing - it is easy to imagine them releasing something that isn't their best model ever and so they clarify that this is, in fact, the new hotness.
ghshephard
o3 mini wasn't so much a most advanced model, as it was incredibly affordable for the IQ it was presenting at the time. Sometimes it's about efficiency and not being on the frontier.
CamperBob2
(Shrug) It's common for less-than-foundation-level models to be released every so often. This is done in order to provide new options, features, pricing, service levels, APIs or whatever that aren't yet incorporated into the main model, or that are never intended to be.
Just a consequence of how much time and money it takes to train a new foundation model. It's not going to happen every other week. When it does, it is reasonable to announce it with "Announcing our most powerful model yet."
sigmoid10
Has post-Jobs Apple ever come up with anything that would warrant this hope?
internetter
Every iPhone is their best iPhone yet
brianshaler
Even the 18 Pro Max Ultra with Apple Intelligence?
Obligatory Jobs monologue on marketing people:
azinman2
Not wrong though
layer8
Only the September ones. ;)
pell
Apple silicon chips
kh_hk
No, but I think they stopped with "our most" (since all other brainless corps adopted it) and just connect adjectives with dots.
Hotwheels: Fast. Furious. Spectacular.
sigmoid10
Maybe people also caught up to the fact that the "our most X product" for Apple usually means someone else already did X a long time ago and Apple is merely jumping on the wagon.
magicmicah85
When you keep improving, it's always going to be the best or most: https://www.youtube.com/watch?v=bPkso_6n0vs
caseyy
This is my latest and most advanced comment yet.
Buttons840
Every step of gradient descent is the best model yet!
hombre_fatal
Maybe it’s not useless. 1) it’s only comparing it to their own products and 2) it’s useful to know that the product is the current best in their offering as opposed to a new product that might offer new functionality but isn’t actually their most advanced.
Which is especially relevant when it's not obvious which product is the latest and best just looking at the names. Lots of tech naming fails this test from Xbox (Series X vs S) to OpenAI model names (4o vs o1-pro).
Here they claim 4o is their most capable image generator which is useful info. Especially when multiple models in their dropdown list will generate images for you.
Kiro
What's the problem?
kh_hk
It's a nitpick about the repetitive phrasing for announcements
<Product name>: Our most <superlative> <thing> yet|ever.
rachofsunshine
Speaking as someone who'd love to not speak that way in my own marketing - it's an unfortunate necessity in a world where people will give you literal milliseconds of their time. Marketing isn't there to tell you about the thing, it's there to get you to want to know more about the thing.
null
echelon
I hate modern marketing trends.
This one isn't even my biggest gripe. If I could eliminate any word from the English language forever, it would be "effortlessly".
alkonaut
The whiteboard image is insane. Even if it took more than 8 to find it, it's really impressive.
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
minimaxir
OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had to spin "it's slow but the generated images are worth it"). Instead of using a diffusion approach, it appears to be generating the image tokens and decoding them akin to the original DALL-E (https://openai.com/index/dall-e/), which allows for streaming partial generations from top to bottom. In contrast, Google's Gemini can generate images and make edits in seconds.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
infecto
As a user, images feel slightly slower but comparable to the previous generation. Given the significant quality improvement, it's a fair trade-off. Overall, it feels snappy, and the value justifies a higher price.
kevmo314
Maybe this is the dialup of the era.
ijidak
Ha. That's a good analogy.
When I first read the parent comment, I thought, maybe this is a long-term architecture concern...
But your message reminded me that we've been here before.
asadm
specially with the slow loading effect it has.
aurareturn
If you look at the examples given, this is the first time I've felt like AI generated images have passed the uncanny valley.
The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?
Taek
> it appears to be generating the image tokens and decoding them akin to the original DALL-E
The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.
You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.
low_tech_love
Would it seem otherwise if it was a lie?
throwaway314155
Sorry but I think you may be mistaken if your only source is ChatGPT. It's not aware of its own creation processes beyond what is included in its system prompt.
cubefox
LLMs are autoregressive, so they can't be (multi-modality) integrated with diffusion image models, only with autoregressive image models (which generate an image via image tokens). Historically those had lower image fidelity than diffusion models. OpenAI now seems to have solved this problem somehow. More than that, they appear far ahead of any available diffusion model, including Midjourney and Imagen 3.
Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.
Edit: Apparently Gemini also has an experimental native image generation ability.
SweetSoftPillow
Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.
hansvm
> so they can't be integrated
That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.
summerlight
Your understanding seems outdated, I think people are referring Gemini native image generation
argsnd
Is this the same for their gemini-2.0-flash-exp-image-generation model?
cubefox
No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.
johntb86
Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.
echelon
I expect the Chinese to have an open source answer for this soon.
They haven't been focusing attention on images because the most used image models have been open source. Now they might have a target to beat.
rfoo
ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.
null
keeganpoppen
i find this “slow” complaint (/observation— i dont view this comment as a complaint, to be clear) to be quite confusing. slow… compared to what, exactly? you know what is slow? having to prompt and reprompt 15 times to get the stupid model to spell a word correctly and it not only refuses, but is also insistent that it has corrected the error this time. and afaict this is the exact kind of issue this change should address substantially.
im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?
the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.
hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.
cchance
i mean on free chat an image took maybe 2 seconds?
RobinL
It's very impressive. It feels like the text is a bit of a hack where they're somehow rendering the text separately and interpolating it into the image. Not always, I got it to render calligraphy with flourishes, but only for a handful of words.
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
aurareturn
First AI image generator to pass the uncanny valley test? Seems like it. This is the biggest leap in image generation quality I've ever seen.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
null
kgeist
>First AI image generator to pass the uncanny valley test?
Other image generators I've used lately often produced pretty good images of humans, as well [0]. It was DALLE that consistently generated incredibly awful images. Glad they're finally fixing it. I think what most AI image generators lack the most is good instruction following.
[0] YandexArt for the first prompt from the post: https://imgur.com/a/VvNbL7d The woman looks okay, but the text is garbled, and it didn't fully follow the instruction.
aurareturn
Do you have another example from YandexArt?
https://images.ctfassets.net/kftzwdyauwt9/7M8kf5SPYHBW2X9N46...
OpenAI's human faces look *almost* real.
GaggiX
Ideogram 2.0 and Recraft also create images that looks very much real.
For drawings, NovelAI's models are way beyond the uncanny valley now.
bb88
My experience with these announcements is that they're cherry picking the best results from a maybe several hundred or a thousand prompts.
I'm not saying that it's not true, it's just "wait and see" before you take their word as gold.
I think MS's claim on their quantum computing breakthrough is the latest form of this.
wes-k
The examples they show have little captions that say "best of #", like "best of 8" or "best of 4". Hopefully that truly represents the odds of generating the level of quality shown.
bb88
I'm not doubting it's an improvement, because it looks like it is.
I guess here's an example of a prompt I would like to see:
A flying spaghetti monster with a metal colander on its head flying above New York City saving the world from and very very evil Pope.
I'm not anti/pro spaghetti monster or catholicism. But I can visualize it clearly in my head what that prompt might look like.
aqme28
Some of the prompts are pretty long. I'm curious how iterations it took to get to that prompt for them to take the top 8 out of.
lottaFLOPS
it’s rolling out to users on all tiers, so no need to wait. I tried it and saw outputs from many others. it’s good. very good
bb88
Chat GPT requires logging in with an email. I hesitated on that.
That's why I prefer to wait.
kilroy123
Have you tried it? It's crazy good.
PufPufPuf
Can it be tried? ChatGPT still uses DALL-E for me.
bb88
No offense but after years of vaporware and announcements that seemed more plausible than implausible, I'll remain skeptical.
I will also not give them my email address just to try it out.
Maxion
Why are you making blanket statements on things that you haven't even tried? This is leaps and bounds better than before.
WithinReason
why not use a fake email address?
user3939382
I’ll just be happy with not everything having that over saturated cg/cartoon style that you cant prompt your way out of.
alana314
I was relying on that to determine if images were AI though
LeoPanthera
Frustratingly the DALL-E API actually has an option for this, you can switch it from "vivid" to "realistic".
This option is not exposed in ChatGPT, it only uses vivid.
jjeaff
Is that an artifact of the training data? Where are all these original images with that cartoony look that it was trained on?
wongarsu
A large part of deviantart.com would fit that description. There are also a lot of cartoony or CG images in communities dedicated to fanart. Another component in there is probably the overly polished and clean look of stock images, like the front page results of shutterstock.
"Typical" AI images are this blend of the popular image styles of the internet. You always have a bit of digital drawing + cartoon image + oversaturated stock image + 3d render mixed in. Models trained on just one of these work quite well, but for a generalist model this blend of styles is an issue
astrange
> There are also a lot of cartoony or CG images in communities dedicated to fanart.
Asian artists don't color this way though; those neon oversaturated colors are a Western style.
(This is one of the easiest ways to tell a fake-anime western TV show, the colors are bad. The other way is that action scenes don't have any impact because they aren't any good at planning them.)
jl6
Wild speculation: video game engines. You want your model to understand what a car looks like from all angles, but it’s expensive to get photos of real cars from all angles, so instead you render a car model in UE5, generating hundreds of pictures of it, from many different angles, in many different colors and styles.
ToValueFunfetti
I've heard this is downstream of human feedback. If you ask someone which picture is better, they'll tend to pick the more saturated option. If you're doing post-training with humans, you'll bake that bias into your model.
minimaxir
Ever since Midjourney popularized it, image generation models are often posttrained on more "aesthetic" subsets of images to give them a more fantasy look. It also help obscure some of the imperfections of the AI.
HappMacDonald
.. either that or they are padding out their training data with scads of relatively inexpensive to produce 3d rendered images</speculation>
richardfulop
you really have to NOT try to end up with that result in MJ.
ilaksh
The new model in the drop down says something like "4o Create Image (Updated)". It is truly incredible. Far better than any other image generator as far as understanding and following complex prompts.
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
MoonGhost
> found it strange that more people weren't talking about it.
Some simply dislike everything OpenAI. Just like everything Musk or Trump.
miletus
saw this thread on X. here are some incredible use cases of 4o image generation: https://x.com/0xmetaschool/status/1904804251148443873
What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.