Diffusion training from scratch on a micro-budget

24 comments

·January 13, 2025

philipkglass

The differently styled images of "astronaut riding a horse" are great, but that has been a go-to example for image generation models for a while now. The introduction says that they train on 37 million real and synthetic images. Are astronauts riding horses now represented in the training data more than would have been possible 5 years ago?

If it's possible to get good, generalizable results from such (relatively) small data sets, I'd like to see what this approach can do if trained exclusively on non-synthetic permissively licensed inputs. It might be possible to make a good "free of any possible future legal challenges" image generator just from public domain content.

godelski

  >  Are astronauts riding horses now represented in the training data more than would have been possible 5 years ago?

Yes.

Though I'm a bit confused why this became the goto. If I remember correctly the claim was about it being "out of distribution" but I have high confidence that astronauts riding horses are within the training dataset prior to DALL-E. The big reason everyone should believe this is because astronauts have always been compared to cowboys. And... what do we stereotypically associate with cowboys?

The second reason, is because it is the main poster for the 2006 movie The Astronaut Farmer: https://en.wikipedia.org/wiki/The_Astronaut_Farmer

But here's some other ones I found that are timestamped. It's kinda hard to find random digital art that is timestamped. Looks like even shutterstock doesn't... And places like deviantart don't have great search. Hell... even Google will just flat out ignore advanced search terms (the fuck is even the point of having them?). The term is so littered now that this makes search difficult, but I found two relatively quickly.

2014: https://www.behance.net/gallery/18695387/Space-Cowboy#

2016: https://drawception.com/game/DZgKzhbrhq/badass-space-cowboy-...

But even if the samples did not exist, I do not think this represents a significantly out of distribution, if at all, image. Are we in doubt that there's images like astronauts riding rockets? I think certainly there exists "astronaut riding horse" along the interpolation between "person riding horse" and "astronaut riding <insert any term>". Mind you, generating samples in distribution but not in training (or test) is still a great feat and impressive accomplishment. This should in no way be underplayed at all! But there is a difference in claiming out of distribution.

  > I'd like to see what this approach can do if trained exclusively on non-synthetic permissively licensed inputs

One minor point. The term "synthetically generated" is a bit ambiguous. It may include digital art. It does not necessarily mean generated by a machine learning generative model. TBH, I find the ambiguity frustrating as there is some important distinctions.

doctorpangloss

The original meme about the limitations of diffusion was the text to image prompt, “a horse riding an astronaut.”

It’s in all sorts of papers. This guy Gary Marcus used to be a big crank about AI limitations and was the “being wrong on the Internet” guy who got a lot of mainstream attention to the problem - https://garymarcus.substack.com/p/horse-rides-astronaut. Not sure how much we hear from him nowadays.

The astronaut riding horses thing is from how 10-1,000x more people are doing this stuff now, and kind of process the whole zeitgeist before their arrival with fuzzy glasses. The irony is it is the human, not the generator, that got confused about the purposefully out of sample horse riding an astronaut prompt, and changed it to astronaut riding a horse.

littlestymaar

This whole “Horse riding an astronaut” was a bit dumb in the first place, because AFAIK CLIP (the text encoder used in first-generation diffusion models) doesn't really distinguish the two in the first place. (So fundamentally that Marcus guy was right, the tech employed was fundamentally unable to do what he asked of to do)

> The irony is it is the human, not the generator, that got confused about the purposefully out of sample horse riding an astronaut prompt, and changed it to astronaut riding a horse.

You're mixing things up: "astronaut ridding a horse" was used by OpenAI their Dall-E 2 announcement blog post, ”horse ridding an astronaut" only came after, and had a much more niche audience anyway, so it's absolutely not an instance of “humans got caught by an out of sample instance and misremembered”.

godelski

I was under the impression that the astronaut riding the horse was in use prior to Marcus's tweet. Even that substack post has him complaining about how Technology Review is acting as a PR firm for OpenAI. That article shows an astronaut riding a horse. I mean that image was in the announcement blog post[0]

Certainly Marcus's tweets played a role in the popularity of the image, but I'm not convinced this is the single causal root.

[0] https://openai.com/index/dall-e-2/

throwaway314155

It wasn't because it was "out of distribution" (although that's a reasonable assumption and it is at least _somewhat_ out of distribution, given the scarcity of your examples).

Like the avocado armchair before it, the real reason was simply that it "looked cool". It scratched some particular itch for people.

For me, indeed it's correlated with "imagination". An avocado armchair output had a particular blending of concepts that matched (in my mind) the way humans blend concepts. With the "astronaut riding a horse on the moon", you are hitting a little of that; but also you're effectively addressing criticism about text-to-image models with a prompt that serves as an evaluation for a couple of things:

1.) t2i is bad at people (astronaut)

2.) t2i struggles with animal legs (horse)

3.) t2i struggles with costumes, commonly putting the spacesuit on both the astronaut _and_ the horse - and mangling that in the process (and usually ruining any sense of good artistic aesthetics).

4.) t2i commonly gets confused with the moon specifically, frequently creating a moon _landscape_ but also doing something silly like putting "another" moon in the "night sky" as well.

There are probably other things. And of course this is subjective. But as someone who followed these things as they happened, which was I believe the release of DALL-E 2 and the first Stable Diffusion models, this is why I thought it was a good evaluation (at the time).

edit: I truly despise HN comment's formatting rules.

godelski

  > although that's a reasonable assumption and it is at least _somewhat_ out of distribution, given the scarcity of your examples

This isn't what "out of distribution" means. There can be ZERO images and it wouldn't mean something is OOD. OOD means not within the underlying distribution. That's why I made the whole point about interpolation.

Is it scarce? Hard to tell. But I wouldn't make that assumption based on my examples. Search is poisoned now.

I think there's a lot of things that are assumed scarce which are not. There's entire datasets that are spoiled because people haven't bothered to check.

  > edit: I truly despise HN comment's formatting rules.

  Add 2 spaces
  on a new
  line and 

  you can do whatever you want because it is a quote block
  That's also why I quote people this way

llm_trw

>The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days.

That's a $250,000 machine for the micro budget. Or if you don't want to do it locally ~$2,000 to do it on someone else's machine for the one model.

GaggiX

You can do it on one single GPU but you would need to use gradient accumulation and the training would probably last 1-2 months on a consumer GPU.

programd

Accepting the 1-2 month estimate at face value we're firmly in hobby territory now. Any adult with a job and a GPU can train their own models for an investement roughly equivelent to a high end gaming machine. Let's run some very hand wavy numbers:

RTX 4090 ($3000) + CPU/Motherboard/SSD/etc ($1600) + two months at full power ($300) is only on the order of $5000 initial investment for the first model. After that you can train 6 models per year to your exact specifications for an extra $150 per month in power usage. This cost will go down.

I'm expecting an explosion of micro-AI models specifically tailored for very narrow use cases. I mean Hugging face already has thousands of models, but they're mostly reusing the aligned big corporate stuff. What's coming is an avalanche of infinately creative micro-AI models, both good and bad. There are no moats.

It's going to be kind of like when teenagers got their hands on personal computers. Oh wait....

godelski

From the abstract

  Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118× lower cost than stable diffusion models and 14× lower cost than the current state-of-the-art approach that costs $28,400.

Figure 1

  Qualitative evaluation of the image generation capabilities of our model (512×512 image resolution). Our model is trained in 2.6 days on a single 8×H100 machine (amounting to only $1,890 in GPU cost) without any proprietary or billion image dataset.

End of intro under the key contributions bullet points

  - Using a micro-budget of only $1,890, we train a 1.16 billion parameter sparse diffusion transformer on 37M images and a 75% masking ratio that achieves a 12.7 FID in zero-shot generation on the COCO dataset. The wall-clock time of our training is only 2.6 days on a single 8×H100 GPU machine, 14× lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost).

I'm just saying, the authors are not trying to hide this point. They are making it abundantly clear.

I should also mention that this is the most straightforward way to discuss pricing. It is going to be much more difficult if they do comparisons including the costs of the machines as then there needs to be an amortization cost baked in and that's going to have to include costs of electricity, supporting hardware, networking, how long the hardware is used for, at what percentage utility the hardware is, costs of employees to maintain, and all that fun stuff. Which... you can estimate by... GPU rental costs... Since they are in fact baking those numbers in. They explain their numbers in the appendix under Table 5. It is estimated at $3.75/H100/hr.

Btw, they also state a conversion to A100s

nickpsecurity

I've been collecting papers on straining models on small numbers of GPU's. What I look for is (a) type of GPU, (b) how many, and (c) how long it ran. I can quickly establish a minimum cost from that.

I say minimum because there's pre-processing data, setting up the machine configuration, trial runs on small data to make sure it's working, repeats during the main run if failures happened, and any time to load or offload data (eg checkpoints) from the GPU instance. So, the numbers in the papers are a nice minimum rather than the actual cost of a replication which is highly specific to one's circumstances.

godelski

Sure... but they provide all that too. They're just saving most people extra work. And honestly, I think it is nice to have a historical record. There's plenty of times I'm looking in papers for numbers that don't seem relevant at the time but are later. Doesn't hurt.

vessenes

It’s still super cheap. I think the better nit to pick is that it’s almost a freebie after NVIDIA’s (hardware) and the rest of the world (software) billions of dollars of R&D and CapEx have gone into the full training stack. So, this is more like a blessing of the commons.

corysama

So, a kilo-budget.

buyucu

I love models on a budget. These are the ones that really make us think what we're doing and bring out new ideas.

srameshc

This is the first time I came across micro-budget term in AI context.

> end-to-end model on an 8×H100 machine is 2.6 days based on the pricing on Lambda labs site, it's about $215 which isn't bad for training a model for educational purposes.

dylan-condensa

If you want to cost optimize even further, you can get 8 x H100 machines for around $4.00 less per hour through Denvr on Shadeform’s GPU Cloud Marketplace (YC S23).

null

[deleted]

__loam

The pixel art these models produce continues to look like shit and not be actual pixel art.

Where'd you get your dataset? Did you get permission from the rightsholders to use their work for this?

GaggiX

This model is kind of the bottom of the barrel. Also there are models that specialize in pixel art: https://www.pixellab.ai/

AuryGlenz

I’m guessing that’s built on Stable Diffusion/Flux just as RetroDiffusion is. That would mean it’s not directly pixel art and needs to be downscaled after the fact. The results can still be pretty great but it’s not super ideal.

Training a small (64x64 or the like) pixel based model instead of something that relies on a VAE may in fact be cheaper than what’s in the OP and could probably make someone some good money besides but the lack of training data would be a huge issue.