Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

183 comments

·June 4, 2025

ashishb

I love Google Cloud Run and highly recommend it as the best option[1]. The Cloud Run GPU, however is not something I can recommend. It is not cost effective (instance based billing is expensive as opposed to request based billing), GPU choices are limited, and the general loading/unloading of model (gigabytes) from GPU memory makes it slow to be used as server less.

Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.

1 - https://ashishb.net/programming/free-deployment-of-side-proj...

gabe_monroy

google vp here: we appreciate the feedback! i generally agree that if you have a strong understanding of your static capacity needs, pre-provisioning VMs is likely to be more cost efficient with today's pricing. cloud run GPUs are ideal for more bursty workloads -- maybe a new AI app that doesn't yet have PMF, where you really need that scale-to-zero + fast start for more sparse traffic patterns.

jakecodes

Appreciate the thoughtful response! I’m actually right in the ICP you described — I’ve run my own VMs in the past and recently switched to Cloud Run to simplify ops and take advantage of scale-to-zero. In my case, I was running a few inference jobs and expected a ~$100 bill. But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.

I’m fairly experienced with GCP, but even then, the billing model here caught me off guard. When you’re dealing with machines that can run up to $64K/month, small missteps get expensive quickly. Predictability is key, and I’d love to see more safeguards or clearer cost modeling tooling around these types of workloads.

gabe_monroy

Apologies for the surprise charge there. It sounds like your workload pattern might be sitting in the middle of the VM vs. Serverless spectrum. Feel free to email me at (first)(last)@google.com and I can get you some better answers.

ashishb

> But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.

Indeed. IIRC, if you get a single request every 15 mins (~100 requests a day), you will pay for Cloud Run GPU for the full day.

Sn0wCoder

Has this changed? When I looked pre-ga the requirements were you need to pay for the CPU 24x7 to attach a GPU so that is not really scaling to zero unless this requirement has changed...

ashishb

Speaking from my experience, it does scale to zero except you pay for 15 mins after the last request.

So if you get all your requests in a 2 hours window then that's great. It will scale to zero for rest of the 22 hours.

However, if you get at least one request every 15 mins then you will pay for 24 hours and it is ~3X more expensive then equivalent VM on Google Cloud.

krembo

How does that compare to spinning up some ec2s with amazon trainium gpus?

mgraczyk

Depending on your model, you may spend a lot of time trying to get it to work with Trainium

icedchai

Cloud Run is a great service. I find it much easier to work with than AWS's equivalent (ECS/Fargate.)

psanford

AWS AppRunner is the closest equivalent to Cloud Run. Its really not close though, AppRunner is an unloved service at AWS and is missing a lot of the features that make Cloud Run nice.

vrosas

AppRunner was Amazon's answer to AppEngine a full decade+ later. Cloud Run is miles ahead.

romanhn

I agree with the unloved part. It was a great middle ground between Lambda and Fargate (zero cold start, reasonable pricing), but has seemingly been in maintenance mode for quite a while now. Really sad to see.

gabe_monroy

i am biased, but i agree :)

icedchai

hah. I looked at your comments and saw you were a google VP! I've migrated some small systems from AWS to GCP for various POCs and prototypes, mostly Lambda and ECS to Cloud Run, and find GCP provides a better developer experience overall.

ashishb

Yeah, anyone who uses GCP and AWS thoroughly will agree that GCP is a superior developer experience.

The problem is continuous product churn. This was discussed at length at https://news.ycombinator.com/item?id=41614795

AChampaign

I think Lambda is more or less the AWS equivalent.

icedchai

It's not. Cloud Run can be longer running: you can have batch and services. Lambda is closer to Cloud Functions.

ZeroCool2u

I think Cloud Run Functions would be the direct equivalent to Lambda.

shiftyck

Eh idk Cloud Run is much better suited to long running instances than Lambda. You would use Cloud Functions for those types of workloads in GCP.

mountainriver

The problem is you can't reliably get VMs on GCP.

All the major clouds are suffering from this. AWS you can't ever get an 80gb gpu without a long term reserve and even then it's wildly expensive. GCP you can sometimes but its also insanely expensive.

These companies claim to be "startup friendly", they are anything but. All the neo-clouds somehow manage to do this well (runpod, nebius, lambda) but the big clouds are just milking enterprise customers who won't leave and in the process screwing over the startups.

This is a massive mistake they are making, which will hurt their long term growth significantly.

covi

To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,

$ sky launch --gpus H100

will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.

Essentially the way you deal with it is to increase the infra search space.

rendaw

We've hit into this a lot lately too, even on AWS. "Elastic" compute, but all the elasticity's gone. It's especially bitter since splitting the costs for spare capacity is the major benefit of scale here...

mountainriver

Enterprises are just gobbling up all the supply on reserves so they see no need to lower the price.

All the while saying they are "startup friendly".

dconden

Agreed. Pricing is insane and availability generally sucks.

If anyone is curious about these neo-clouds, a YC startup called Shadeform has their availability and pricing in a live database here: https://www.shadeform.ai/instances

They have a platform where you can deploy VMs and bare metal from 20 or so popular ones like Lambda, Nebius, Scaleway, etc.

bodantogat

I had the opposite experience with cloud run. Mysterious scale outs/restarts - I had to buy a paid subscription to cloud support to get answers and found none. Moved to self managed VMs. Maybe things have changed now.

PaulMest

Sadly this is still the case. Cloud Run helped us get off the ground. But we've had two outages where Google Enhanced Support could give us no suggestion other than "increase the maximum instances" (not minimum instances). We were doing something like 13 requests/min on this instance at the time. The resource utilization looked just fine. But somehow we had a blip in any containers being available. It even dropped below our min containers. The fix was to manually redeploy the latest revision.

We're now investigating moving to Kubernetes where we will have more control over our destiny. Thankfully a couple people on the team have experience with this.

Something like this never happened with Fargate in the years my previous team had used that.

ajayvk

https://github.com/claceio/clace is project I am building which gives a Cloud Run type deployment experience on your own VMs. For each app, it supports scale down to zero containers (scaling up beyond one is being built).

The authorization and auditing features are designed for internal tools, any app can be deployed otherwise.

holografix

Have a look at Knative

Bombthecat

You don't go to cloud services because they are cheaper.

You go there because you are already there or have contracts etc etc

JoshTriplett

Does Cloud Run still use a fake Linux kernel emulated by Go, rather than a real VM?

Does Cloud Run give you root?

seabrookmx

You're thinking of gvisor. But no, the "gen2" runtime is a microvm ala firecracker and performs a lot better as a result.

JoshTriplett

Ah, that's great.

And it looks like Cloud Run can do something Lambda can't: https://cloud.google.com/run/docs/create-jobs . "Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests."

pryz

https://github.com/cloud-hypervisor/cloud-hypervisor or something else?

rpei

We (I work on Cloud Run) are working on root access. If you'd like to know more you can reach me rpei@google.com

JoshTriplett

Awesome! I'll reach out to you, thank you.

dig1

> I love Google Cloud Run and highly recommend it as the best option

I'd love to see the numbers for Cloud Run. It's nice for toy projects, but it's a money sink for anything serious, at least from my experience. On one project, we had a long-standing issue with G regarding autoscaling - scaling to zero sounds nice on paper, but they will not mention you the warmup phases where CR can spin up multiple containers for a single request and keep them for a while. And good luck hunting for unexplainedly running containers when there are no apparent cpu or network uses (G will happily charge you for this).

Additionally, startup is often abysmal with Java and Python projects (although it might perform better with Go/C++/Rust projects, but I don't have experience running those on CR).

tylertreat

> It's nice for toy projects, but it's a money sink for anything serious, at least from my experience.

This is really not my experience with Cloud Run at all. We've found it to actually be quite cost effective for a lot of different types of systems. For example, we ended up helping a customer migrate a ~$5B/year ecommerce platform onto it (mostly Java/Spring and Typescript services). We originally told them they should target GKE but they were adamant about serverless and it ended up being a perfect fit. They were paying like $5k/mo which is absurdly cheap for a platform generating that kind of revenue.

I guess it depends on the nature of each workload, but for businesses that tend to "follow the sun" I've found it to be a great solution, especially when you consider how little operations overhead there is with it.

ivape

Maybe I just don't know, but I really don't think most people here can even point to a cloud GPU with 1000 concurrent users and not end up with a million dollar bill.

isoprophlex

All the cruft of a big cloud provider, AND the joy of uncapped yolo billing that has the potential to drain your creditcard overnight. No thanks, I'll personally stick with Modal and vast.ai

montebicyclelo

Not providing a cap on spending is a major flaw of GCP for individuals / small projects.

With Cloud Run, AFAIK, spending can effectively be capped by: limiting concurrency, plus limiting the max number of instances it can scale to. (But this is not as good as GCP having a proper cap.)

brutus1213

Amazon is the same I think? I live in constant fear we will have a runaway job one day. I get daily emails to myself (as a manager) and to my finance person. We had one instance where a team member forgot to turn off a machine for a few months :(

I get why it is a business strategy to not have limits .. but I wonder if providers would get more usage if people had more trusts on costs/predictability.

anonymousab

I remember going out to dinner, years ago, with a fairly senior AWS billing engineer. An acquaintance of a coworker.

He looked completely surprised when I asked about runaway billing and why there wasn't any simple options to cap a given resource to prevent those cases.

His response was that they didn't build that because none of their customers wanted anything like that, as far as he was aware.

coredog64

There's a coarse option: Set up a budget and then a budget action. While ECS doesn't have GPU capabilities, the equivalent here would be "IAM action of budget sets deny on expensive service IAM action" (SCP is also available, but that requires an AWS Org, at which point you've probably got a team that already knows this)

It's coarse because it's daily and not hourly. However, you could also self-service do some of this with CloudWatch metrics to map to a cost and then have an alarm action.

https://aws.amazon.com/blogs/mt/manage-cost-overruns-part-1/

tmoertel

> I get why it is a business strategy to not have limits...

What is the strategy? Is is purely market segmentation? (As in: "If you need to worry about spending too much, you're not the big-money kind of enterprise customer we want"?)

yarri

[edit - Gabe responded]. See this Cloud Run spending cap recommendation [0] to disable billing, which potentially irreversibly deletes resources but does cap spend!

[0] https://cloud.google.com/billing/docs/how-to/disable-billing...

badrequest

Sure, but why post a tutorial of how to spin this up in GCP instead of...productizing it in GCP?

gabe_monroy

Heard on this feedback. While not quite a hard cap, I'd also point to https://cloud.google.com/billing/docs/how-to/budgets which many customers are having success with for this use case.

advisedwang

It's rock and a hard place for the cloud providers.

Cap billing, and you have created an outage waiting to happen, one that will be triggered if they ever have sudden success growth.

Don't cap billing, and you have created a bankruptcy waiting to happen.

delfinom

Flaw? Nah

Feature for Google's profits.

kamranjon

I dunno, the scale to zero and pay per second features seemed super useful to me after forgetting to shut down some training instances with AWS. Also the fast startup ability, if it actually works as well as they say, would be amazing for a lot of the type of workloads that I have.

isoprophlex

Agreed, but runpod or modal offer the same. Happy to use big cloud for a client if they pay the bills, but for personal quests... too scary.

decimalenough

You can set max instances in Cloud Run, which is an effective limit on how much you'll spend.

Also, hard dollar caps are rarely if ever the right choice. App Engine used to have these, and the practical effect was that your website would completely stop working exactly when you least want it to (posted on HN etc).

It's better to set billing alerts and make the call yourself if they go off.

rustc

> Also, hard dollar caps are rarely if ever the right choice.

Depends on if you're a big business or an individual. There is absolutely no reason I would ever pay $100k for a traffic burst on my personal site or side project (like the $100k Netlify case a few months ago).

> It's better to set billing alerts and make the call yourself if they go off.

Billing alerts are not instant and neither is anyone online 24x7 monitoring the alerts.

brutus1213

100% agreed. This can be solved with technology .. let users set a soft and hard threshold for example. Runaway costs is the problem here.

ipaddr

One bad actor / misconfiguration / attack can put you out of business. It not the safest strategy to allow unlimited liability in business or for personal projects.

petesergeant

I've abandoned DataDog in production for just this reason. Is the amount of money they make on dinging people who screw up really worth the ill-will and people who decide they're just not going to start projects on these platforms?

geodel

> Is the amount of money they make on dinging people who screw up really worth the ill-will

I think it is .

1) They make money for services they provided instead of looking into meaning of what customer actually wanted.

2) Small time customers move away so they concentrate energy on big enterprise sales.

Not justifying anything here but it just kind of make business sense for them.

petesergeant

Definitely possible. I wonder over what time period you miss out on small customers who become big customers and go on that journey with you; perhaps that would be minimal anyway.

weinzierl

I never used modal or vast.ai and from their pages it was not obvious how they solve the yolo billing issue? Are they pre-paid or do they support caps?

thundergolfer

Engineer from Modal here: we support caps. They kick in within ~2s if your usage exceeds the configured limit.

sharifhsn

I know vast.ai uses a prepaid credits system.

geodel

Doesn't seem vast. Seems tight-budget.ai to me :-)

oldandboring

> uncapped yolo billing

This made me laugh out loud, thank you for this!

rikafurude21

thats what billing limits are for

isoprophlex

Unless something changed gcp only does billing alerts, not billing limits

aiiizzz

Those, on gcp, are just alerts, not hard limits, no?

jsheard

Yeah. I think you can hack together a function which pulls the plug automatically if a billing alert fires, but IIRC the alerts can take a few hours to respond, so extreme runaway usage could still result in a bad time.

spacecadet

Runpod is pretty great. I wrote some genetic end point script that I can deploy in seconds, download the models to the pod, and Im ready to go. Plus I forgot and left a pod running, but down, for a week and it was like 0.60, and they emailed me like 3 times reminding me of the pod.

mythz

The pricing doesn't look that compelling, here are the hourly rate comparisons vs runpod.io vs vast.ai:

    1x L4 24GB:    google:  $0.71; runpod.io:  $0.43, spot: $0.22
    4x L4 24GB:    google:  $4.00; runpod.io:  $1.72, spot: $0.88
    1x A100 80GB:  google:  $5.07; runpod.io:  $1.64, spot: $0.82; vast.ai  $0.880, spot:  $0.501
    1x H100 80GB:  google: $11.06; runpod.io:  $2.79, spot: $1.65; vast.ai  $1.535, spot:  $0.473
    8x H200 141GB: google: $88.08; runpod.io: $31.92;              vast.ai $15.470, spot: $14.563

Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.

otherjason

Where did you get the pricing for vast.ai here? Looking at their pricing page, I don't see any 8xH200 options for less than $21.65 an hour (and most are more than that).

zackangelo

I think it’s a typo, looks pretty close to their 8xH100 prices.

progbits

You can just go to "create compute instance" to see the spot pricing.

Eg GCP price for spot 1xH100 is $2.55/hr, lower with sustained use discounts. But only hobbyists pay these prices, any company is going to ask for a discount and will get it.

steren

> Google's pricing also assumes you're running it 24/7 for an entire month

What makes you think that?

Cloud Run [pricing page](https://cloud.google.com/run/pricing) explicitly says : "charge you only for the resources you use, rounded up to the nearest 100 millisecond"

Also, Cloud Run's [autoscalling](https://cloud.google.com/run/docs/about-instance-autoscaling) is in effect, scaling down idle instances after a maximum of 15 minutes.

(Cloud Run PM)

mythz

Because the pricing when creating an instance shows me the cost for the entire month, then works out the average hourly price based on that. This is just creating a GPU VM instance, I don't see how to see the cost of different NVidia GPUs without it.

If you wanted to show hourly pricing, you would show that first, then calculate the monthly price from the hourly rate. I've no idea if the monthly cost includes sustained usage discount and what the hourly cost is for just running it for an hour.

counters

Nothing but 1xL4 are even offered on Cloud Run GPUs, are they?

ZiiS

I think the Google prices are billed per-second so under 20min you are better on Google?

mythz

RunPod also charges per second [1], also this is Google's expected avg cost per hour after running it 24/7 for an entire month, I couldn't find an hourly cost for each GPU.

When you need under <1hr than you can go with Runpod's Spot pricing which is ~4-7x cheaper than Google, where even 20min of Google would cost more than 1hr on RunPod.

[1] https://docs.runpod.io/serverless/pricing

thousand_nights

runpod is billed by the minute

bts4

Technically we bill Pods by the millisecond. Pennies matter :)

jbarrow

I’m personally a huge fan of Modal, and have been using their serverless scale-to-zero GPUs for a while. We’ve seen some nice cost reductions from using them, while also being able to scale WAY UP when needed. All with minimal development effort.

Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?

scj13

Modal is great, they even released a deep dive into their LP solver for how they're able to get GPUs so quickly (and cheaply).

Coiled is another option worth looking at if you're a Python developer. Not nearly as fast on cold start as Modal, but similarly easy to use and great for spinning up GPU-backed VMs for bursty workloads. Everything runs in your cloud account. The built-in package sync is also pretty nice, it auto-installs CUDA drivers and Python dependencies from your local dev context.

(Disclaimer: I work with Coiled, but genuinely think it's a good option for GPU serverless-ish workflows. )

AndresSRG

I’m also a big fan.

Modal has the fastest cold-start I’ve seen for 10GB+ models.

dr_kiszonka

Thanks for sharing! They even support running HIPAA-compliant workloads, which I didn't anticipate.

chrishare

Modal documentation is also very good.

montebicyclelo

Reason Cloud Run is so nice compared to other providers is that it has autoscaling, with scaling to 0. Meaning it can cost basically 0 if it's not being used. Also can set a cap on the scaling, e.g. 5 instances max, which caps the max cost of the service too. - Note, I only have experience with the CPU version of Cloud Run, (which is very reliable / easy).

rvnx

Even regular Cloud Run can take a lot of time to boot (~3 to 30 seconds), so this can be a problem when scaling to 0

gizzlon

That's not my experience, using Go. Never measured, but it goes to 0 all the time, so I would definitely noticed more than a couple of seconds.

827a

It depends on whether you're on gen1 or gen2 Cloud Run; the default execution environment is `default` which means "you have no idea because GCP selects for you" (not joking).

Counterintuitively (again, not joking): gen2 suffers from really bad startup speeds, because its more like a full-on linux VM/container than whatever weird shim environment gen1 runs. My Gen2 containers basically never start up faster than 3 seconds. Gen1 is much faster.

Note that gen1 and gen2 Cloud Run execution environments are an entirely different concept than first generation and second generation Cloud Functions. First gen Cloud Functions are their own thing. Second generation Cloud Functions can be either first generation or second generation Cloud Run workloads, because they default to the default execution environment. Believe it or not, humans made this.

karn97

[dead]

lexandstuff

Not to mention, if it's an ML workload, you'll also have to factor in downloading the weights and loading them into memory, which can double that time or more.

rvnx

According to the press release, "we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model"

Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.

mdhb

I’m looking at logs for a service I run on cloud run right now which scales to zero. Boot times are approximate 200ms for a Dart backend.

huksley

A small and independent EU GPU cloud provider, DataCrunch (I am not affiliated), offers VMs with Nvidia GPUs even cheaper than Run Pod, etc

1x A100 80Gb 1.37€/hour

1x H100 80Gb 2.19€/hour

sigmoid10

That's funny. You can get a 1x H100 80Gb VM at lambda.ai for $2.49/hour. At the current exchange rate, that's exactly 2.19€. Coincidence or is this actually some kind of ceiling?

diggan

Or go P2P with Vast.ai, cheapest A100 right now is a setup with 2x A100 for $0.8/hour (so $0.4 per A100). Not affiliated with them, but mostly happy user. Be vary of network speeds though, some hosts are clearly on shared bandwidth and reported numbers don't always line up with reality, which kind of sucks when you're trying to shuffle around 100GB of data.

triknomeister

You really need NVL for some performance.

diggan

Ok, did you check the instance list? There is a bunch of 8x H200 NVL available?

gabe_monroy

i'm the vp/gm responsible for cloud run and GKE. great to see the interest in this! happy to answer questions on this thread.

albeebe1

Oh this is great news. After a $1000 bill running a model on vertex.ai continuously for a little test i forgot to shut down, this will be my go to now. I've been using Cloud Run for years running production microservices, and little hobby projects and i've found it simple and cost effective.

felix_tech

I've been using this for daily/weekly ETL tasks which saves quite a lot of money vs having an instance on all the time but it's been clunky.

The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out

covi

Take a look at SkyPilot. Good for running these batch workloads. You can use spot instances to save costs.

lemming

If I understand this correctly, I should be able to stand up an API running arbitrary models (e.g. from Hugging Face), and it’s not quite charged by the token but should be very cheap if my usage is sporadic. Is that correct? Seems pretty huge if so, most of the providers I looked at required a monthly fee to run a custom model.

lexandstuff

Yes, that's basically correct. Except be warned that the cold start times can be huge (30-60 seconds). So scaling to 0 doesn't really work in practice, unless your users are happy to wait from time to time. Also, you also have to pay a small monthly fee for container storage (and a few other charges iirc).

42lux

Runpod, vast, coreweave, replicate... just a bunch of alternatives that let you run serverless GPU inference.

_zoltan_

you can't just sign up for coreweave, can you?

42lux

We wrote them an email.

jjuliano

I'm the developer of kdeps.com, and I really like Google Cloud Run, been using it since beta version. Kdeps outputs Dockerized full-stack AI agent apps that runs open-source LLMs locally, and my project works so well with GCR.

Love cloud run and this looks like a great addition. Only things I wish from cloud run is being able to run self hosted GitHub runners on it (last time I checked this wasn’t possible as it requires root), also the new worker pool feature seems great in practice but it looks like you have to write the scaler yourself rather than it being built in.

null

[deleted]

aniruddhc

Hi! I'm the Eng Manager responsible for Autoscaling for Serverless and Worker Pools.

We're actively defining our roadmap, and understanding your use case would be incredibly valuable. If you're open to it, please email me at <my HN username>@google.com. I'd love to learn more about how you'd use worker pools and what kind of workloads you need to scale.