Skip to content(if available)orjump to list(if available)

How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference

dabedee

This isn't really about cost savings, it's about control. Self-hosting makes sense when you need data privacy, custom fine-tuning, specialized models, or predictable costs at scale. For most use cases requiring GPT-4o-mini quality, you'll pay more for self-hosting until you reach significant volume.

amelius

How to move from one service that is out of your control to another service that is out of your control.

Incipient

Having the ABILITY to move seamlessly and without significant cost is absolutely critical.

It gives you flexibility if the provider isn't keeping pace with the market and it prevents the provider from jacking prices relative to its competitors.

Vendor lockin is awful. Hypothetically, imagine how stuffed you'd be if your core virtualisation provider jacked prices 500%! You'd be really hurting.

...ohwait.

kristianc

You're not really locked in in any meaningful way currently, you just switch the API you're using. Rather like is being demonstrated here.

anonymousDan

I don't understand - what do they mean when they say you can run things on your own infrastructure then?

amelius

They say "serverless infrastructure", which is something else.

dist-epoch

Your own infrastructure in the same sense as your own AWS EC2 machines.

klabb3

> your own AWS EC2 machines

Not disagreeing, but this is quite an expression.

tomschwiha

The "not optimized" self hosted deployment is 3x slower and costs 34x the price using the cheapest GPU / a weak model.

I don't see the point in self hosting unless you deploy a gpu in your own datacenter where you really have control. But that costs usually more for most use cases.

Incipient

Is there actually some scale magic that allows the 34x cost saving (over 100x when you include performance), or is it just insane investment allowing these companies to heavily subsidise cost to gain market share?

tomschwiha

Calculating without energy costs: The A10 Gpu itself costs 3200$. With a 3 year usage that is 0,002$ per minute. From the blog post the cost per minute is charged at 0,02$, so a premium of 10x. So with energy if you can load the GPU at minimum 15-20% self hosted becomes cheaper. But you need to take care of your own infrastructure.

With larger purchases the GPU prices also drop so that is the scaling logic.

ToucanLoucan

> I don't see the point in self hosting unless you deploy a gpu in your own datacenter where you really have control. But that costs usually more for most use cases.

Not wanting to send tons of private data to a company who's foundation is exploiting data it didn't have permission to use?

benterix

To people from Cerebrium: why should I use your services when Runpod is cheaper? I mean, why did you decide to set your prices higher than an established company with significant user base?

eloqdata

Why? Honestly, there are already tons of Model-as-a-Service (MaaS) platforms out there—big names like AWS Bedrock and Azure AI Foundry, plus a bunch of startups like Groq and fireflies.ai. I’m just not seeing what makes Cerebrium stand out from the crowd.

benterix

Well, they are announcing their $8.5m seed round and hope to attract the maximum number of users by giving away $30 in credits.

null

[deleted]

null

[deleted]

Incipient

Is this article just saying openai is orders of magnitude cheaper than cerebrium?

ivape

I’m trying to figure out the cost predictability angle here. It seems like they still have a cost per input/output tokens, so how is it any different? Also, do I have to assume one gpu instance will scale automatically as traffic goes up?

LLM pricing is pretty intense if you’re using anything beyond a 8b model, at least that’s what I’m noticing on OpenRouter. 3-4 calls can approach eating up a $1 with bigger models, and certainly on frontier ones.

jameswhitford

Serverless setups (like Cerebrium) charge per second the model is running, its not token based.

BoorishBears

You're still paying more than the GPU typically costs on an hourly basis to take advantage of their per-second billing... and if you don't have enough utilization to saturate an hourly rental then your users are going to be constantly running into cold starts which tend to be brutal for larger models.

Their A100 80GB is going more than what I pay to rent H100s: if you really want to save money, getting the cheapest hourly rentals possible is the only way you have any hope of saving money vs major providers.

I think people vastly underestimate how much companies like OpenAI can do with inference efficiency between large nodes, large batch sizes, and hyper optimized inference stacks.

ivape

I'll echo one of my original concerns, which is how is this supposed to scale? Am I responsible for that?

ivape

Ah you’re right, misread the OpenAI/cerbrium pricing config variables.