Create and edit images with Gemini 2.0 in preview

88 comments

·May 7, 2025

vunderba

I've added/tested this multimodal Gemini 2.0 to my shoot-out of SOTA image gen models (OpenAI 4o, Midjourney 7, Flux, etc.) which contains a collection of increasingly difficult prompts.

https://genai-showdown.specr.net

I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.

The big "wins" are:

- Multimodal aspect in trying to keep parity with OpenAI's offerings.

- An order of magnitude faster than OpenAI 4o image gen

esperent

Your site is really useful, thanks for sharing. One issue is that the list of examples sticks to the top and covers more than half of the screen on mobile, could you add a way to hide it?

If you're looking for other suggestions a summary table showing which models are ahead would be great.

vunderba

Great point - when I started building it I think I only had about four test cases, but now the nav bar is eating 50% of the vertical display so I've removed it from mobile display!

Wrt to the summary table, did you have a different metric in mind? The top of the display should already be showing a "Model Performance" chart with OpenAI 4 and Google Imagen 3 leading the pack.

ticulatedspline

Excellent site! OpenAI 4o is more than mildly frighting in it's capabilities to understand the prompt. Seems mostly what's holding it back is a tendency away from photo-realism (or even typical digital art styles) and it's own safeguards.

avereveard

It's a bit expensive/slow but for styled request I let it do the base image and when happy with the composition I ask to remake it as a picture or in whatever style needed.

troupo

I also find it weird how it defaults/devolves into this overall brown-ish style. Once you see it, you see it everywhere

echelon

Multimodal is the only image generation modality that matters going forward. Flux, HiDream, Stable Diffusion, and the like are going to be relegated to the past once multimodal becomes more common. Text-to-image sucks, and image-to-image with all the ControlNets and Comfy nodes is cumbersome in comparison to true multimodal instructiveness.

I hope that we get an open weights multimodal image gen model. I'm slightly concerned that if these things take tens to hundreds of millions of dollars to train, that only Google and OpenAI will provide them.

That said, the one weakness in multimodal models is that they don't let you structure the outputs yet. Multimodal + ControlNets would fix that, and that would be like literally painting with the mind.

The future, when these models are deeply refined and perfected, is going to be wild.

zaptrem

Good chance a future llama will output image tokens

liuliu

Do you mind to share which HiDream-I1 model you are using? I am getting better results with these prompts from mine implementation inside Draw Things.

vunderba

Sure - I was using "hidream-l1-dev" but if you're seeing better results - I might rerun the hidream tests with the "hidream-i1-full" model.

I've been thinking about possibly rerunning the Flux Dev prompts using the 1.1 Pro but I liked having a base reference for images that can be generated on consumer hardware.

pkulak

> That mermaid was quite the saucy tart.

Really now?

belter

Your shoot-out site is very useful. Could I suggest adding prompts that expose common failure modes?

For example, asking the models to show clocks set to a specific time or people drawing with their left hand. I think most, if not all models, will likely display every clock with the same time...And portray subjects drawing with their right hand.

vunderba

@belter / @crooked-v

Thanks for the suggestions. Most of the current prompts are a result of personal images that I wanted to generate - so I'll try to add some "classic GenAI failure modes". Musical instruments such as pianos also used to be a pretty big failure point as well.

troupo

For personal images I often play with wooly mammoths, and most models are incapable of generating anything but textbook images. Any deviation either becomes an elephant or an abomination (bull- or bear-like monsters)

crooked-v

Another I would suggest is buildings with specific unusual proportions and details(e.g. "the mansion's west wing is twice the height of the right wing and has only very wide windows"). I've yet to find a model that will do that kind of thing reliably, where it seems to just fall back on the vibes of whatever painting or book cover is vaguely similar to what's described.

droopyEyelids

generating a simple maze for kids is also not possible yet

eminence32

This seems neat, I guess. But whenever I try tools like this, I often run into the limits of what I can describe in words. I might try something like "Add some clutter to the desk, including stacks of paper and notebooks" but when it doesn't quite look like what I want, I'm not sure what else to do except try slightly different wordings until the output happens to land on what I want.

I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head. But I guess I have a lot of doubts about using a conversational interface for this kind of stuff

monster_truck

Chucking images at any model that supports image input and asking it to describe specific areas/things 'in extreme detail' is a decent way to get an idea of what its expecting vs what you want.

thornewolf

+1 to this flow. I use the exact same phrase "in extreme detail" as well haha. Additionally, I ask the model to describe what prompt it might write to produce some edit itself.

crooked-v

I just tried a couple of cases that ChatGPT is bad at (reproducing certain scenes/setpieces from classic tabletop RPG adventures, like the weird pyramid from classic D&D B4 The Lost City), and Gemini fails in just about the same way of getting architectural proportions and scenery details wrong even when given simple, broad rules about them. Adding more detail seems kind of pointless when it can't even get basics like "creature X is about as tall as the building around it" or "the pyramid is surrounded by ruined buildings" right.

BoorishBears

What's an example of a prompt you tried and it failed on?

metalrain

Exactly, describing more complex compositions, lighting, image enchancements/filters there is so many things you know how it looks but to describe it such that LLM gets it and will reproduce it is pretty difficult.

Sometimes sketching it could be helpful, but more abstract technical thing like LUTs, feels still out of reach.

qoez

Maybe that's how the future will unfold. There will be subtle things AI fails to learn, and there will be differences in skills in how good people are at making AI do things, which will be a new skill in itself and will end up being determining difference in pay in the future.

gowld

This is "Prompt Engineering"

betterThanTexas

> I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head.

This is more related to our ability to articulate than is easy to demonstrate, in my experience. I can certainly produce images in my head I have difficulty reproducing well and consistently via linguistic description.

SketchySeaBeast

It's almost as if being able to create art accurate to our mental vision requires practice and skill, be it the ability to create an image or to write it and invoke an imagine in others.

betterThanTexas

Absolutely! But this was surprising to me—my intuition says if I can firmly visualize something, I should be able to describe it. I think many people have this assumption and it's responsible for a lot of projection in our social lives.

bufferoverflow

In that scenario, if you can't describe what you want with words, a human designer can't read your mind either.

Hasnep

No, but a good designer will be able to help you put what you want into words.

gowld

Ask the AI to help you put what you want into words.

xbmcuser

ask Gemini to word your thoughts better then use those to do the image editing.

Nevermark

Perhaps describe the types and styles of work associated with the desk, to create a coherent character to the clutter

mkl

> what the lamp from the second image would look like on the desk from the first image

The lamp is put on a different desk in a totally different room, with AI mush in the foreground. Props for not cherry-picking a first example, I guess. The sofa colour one is somehow much better, with a less specific instruction.

cyral

That one is an odd example.. especially since image #3 does a similar task with excellent accuracy in keeping the old image intact. I've had the same issues when trying to make it visualize adding decor, it ends up changing the whole room or furniture materials.

cush

The doodle demo is super fun

https://aistudio.google.com/apps/bundled/gemini-co-drawing?s...

null

[deleted]

simonw

Be a bit careful playing with this one. I tried this:

  curl -s -X POST \
    "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key=$GEMINI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "contents": [{
        "parts": [
          {"text": "Provide a vegetarian recipe for butter chicken but with chickpeas not chicken and include many inline illustrations along the way"}
        ]
      }],
      "generationConfig":{"responseModalities":["TEXT","IMAGE"]}
    }' > /tmp/out.json

And got back 41MB of JSON with 28 base64 images in it: https://gist.github.com/simonw/55894032b2c60b35f320b6a166ded...

At 4c per image that's more than a dollar on that single prompt.

I built this quick tool https://tools.simonwillison.net/gemini-image-json for pasting that JSON into to see it rendered.

weird-eye-issue

I mean you did ask for "many illustrations"

Yiling-J

I generated 100 recipes with images using gemini-2.0-flash and gemini-2.0-flash-exp-image-generation as a demo of text+image generation in my open-source project: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...

You can see the full table with images here: https://tabulator-ai.notion.site/1df2066c65b580e9ad76dbd12ae...

I think the results came out quiet well. Be aware I don't generate a text prompt based on row data for image generation. Instead, the raw row data(ingredients, instructions...) and table metadata(column names and descriptions) are sent directly to gemini-2.0-flash-exp-image-generation.

minimaxir

Of note is that the per-image pricing for Gemini 2.0 image generation is $0.039 per image, which is more expensive than Imagen 3 ($0.03 per image): https://ai.google.dev/gemini-api/docs/pricing

The main difference is that Gemini does allow for incorporating a conversation to generate the image as demoed here, while Imagen 3 is a strict text-in/image-out with optional mask-constrained edits but likely allows for higher-quality images overall if skilled with prompt engineering. This is a nuance that is annoying to differentiate.

vunderba

Anecdotal but from preliminary sandbox testing side-by-side with Gemini 2.0 Flash and Imagen 3.0 - it definitely appears that that is the case - higher overall visual quality from Imagen 3.

ipsum2

> likely allows for higher-quality images overall

What makes you say that?

thornewolf

Model outputs look good-ish. I think they are neat. I updated my recent hack project https://lifestyle.photo to the new model. It's middling-to-good.

There are a lot of failure modes still but what I want is a very large cookbook showing what known-good workflows are. Since this is just so directly downstream of (limited) training data it might be that I am just prompting in a ever so slightly bad way.

sigmaisaletter

Re your project: I'd expect at least the demo to not have an obvious flaw. The "lifestyle" version of your bag has a handle that is nearly twice as long as the "product" version.

thornewolf

This is a fair critique. While I am merely a "LLM wrapper", I should put the product's best foot forward and pay more attention to my showcase examples.

nico

Love your project, great application of gen AI, very straightforward value proposition, excellent and clear messaging

Very well done!

thornewolf

Thank you for the kind words! I am looking forward to creating a Show HN next week alongside a Product Hunt announcement. I appreciate any and all feedback. You can provide it through the website directly or through the email I have attached in my bio.

mNovak

I'm getting mixed results with the co-drawing demo, in terms of understanding what stick figures are, which seems pretty important for the 99% of us who can't draw a realistic human. I was hoping to sketch a scene, and let the model "inflate" it, but I ended up with 3D rendered stick figures.

Seems to help if you explicitly describe the scene, but then the drawing-along aspect seem relatively pointless.

Tsarp

There are direct prompt tests and then there are tests with tooling.

If for example you use controlnets you can pretty much get very close to a style composition that you need with an open model like Flux that will be far better. Flux has a few successors coming up now

emporas

I use gemini to create covers for songs/albums i make, with beautiful typography. Something like this [1]. I was dying of curiosity, how ideogram managed to create such gorgeous images. I figured it out 2 days ago.

I take an image with some desired colors or typography from an already existing music album or from Ideogram's poster section. I pass it to gemini and give the command:

"describe the texture of the picture, all the element and their position in the picture, left side, center right side, up and down, the color using rgb, the artistic style and the calligraphy or font of the letters"

Then i take the result and pass it through an LLM, a different LLM because i don't like gemini that much, i find it is much less coherent than other models. I use qwen-qwq-32b usually and I take the description gemini outputs and give it to qwen:

" write a similar description, but this time i want a surreal painting with several imaginative colors. Follow the example of image description, add several new and beautiful shapes of all elements and give all details, every side which brushstrokes it uses, and rgb colors it uses, the color palette of the elements of the page, i want it to be a pastel painting like the example, and don't put bioluminesence. I want it to be old style retro style mystery sci fi. Also i want to have a title of "Song Title" and describe the artistic font it uses and it's position in the painting, it should be designed as a drum n bass album cover "*

Then i take the result and give it back to gemini with command: "Create an image with text "Song Title" for an album cover: here is the description of the rest of the album"

If the resulting image is good, then it is time to add font, i take the new image description and pass it through qwen again, supposing the image description has fields Title and Typography:

"rewrite the description and add full description of the letters and font of text, clean or distressed, jagged or fluid letters or any other property they might have, where they are overlayed, and make some new patterns about the letter appearance and how big they are and the material they are made of, rewrite the Title and Typography."

I replace the previous description's section Title and Typography with the new description and create images with beautiful fonts.

[1] https://imgur.com/a/8TCUJ75

pentagrama

I want to take a step back and reflect on what this actually shows us. Look at the examples Google provides: it refers to the generated objects as "products", clearly pointing toward shopping or e-commerce use cases.

It seems like the real goal here, for Google and other AI companies, is a world flooded with endless AI-generated variants of objects that don’t even exist yet, crafted to be sold and marketed (probably by AI too) to hyper-targeted audiences. This feels like an incoming wave of "AI slop", mass-produced synthetic content, crashing against the small island of genuine human craftsmanship and real, existing objects.

hapticmonkey

It's sort of sad how these tools went from "godlike new era of human civilization" to "some commodity tools for marketing teams to sell stuff".

I get that they are trying to find some practical used cases for their tools. But there's no enlightenment in the product development here.

If this is already the part of the s-curve where these AI tools get diminishing returns...what a waste of everybody's time.

nly

Recently I've been seeing a lot of holiday lets on sites like Rightmove (UK) and Airbnb with clearly AI generated 'enhancements' to the photos.

It should be illegal in my view.

vunderba

Yeah - and honestly I don't really get this. Using GenAI for real-world products seems like a recipe for a slew of incoming fraudulent advertising lawsuits if the images are slightly different from the actual physical products yet presented as if they are actual real photographs.

nkozyra

The gating factor here is the pool of consumers. Once people have slop exhaustion there's nobody to sell this to.

Maybe this is why all of the future AI fiction has people dressed in the same bland clothing.