The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing
121 comments
·April 14, 2025greatgib
bigiain
It's 1999 or 2000, and "proper" web developers, who wrote Perl (as God intended) or possibly C (if they were contributors to the Apache project), started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.
History repeats itself...
XCSme
I still use PHP, is your point that everyone will happily use NextJS in 10 years? I doubt it.
Grimblewald
I really hope not, because I really hate how JS has fucked the internet. Just look at how shit the experience on www.reddit.com vs old.reddit.com is. Old uses a lot of JS now, which has made the experience a touch worse, but it still mostly serves a static HTML page. It loads quickly, renders quickly, and lets me do useful preference based things on my end.
I hate what JS has done to the internet, and I think it plays a heavy hand in the internets enshitification.
threatofrain
The proper comparison is with Laravel. How will Laravel fare vs Next in 10 years? Hard to say, they could both be equally legacy by then.
navs
> started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.
Your point is they shouldn't?
bigiain
Nah, my point was that people who ignored the incumbent "wisdom" in the late 90's, actually took over the web.
As much as some "technical" people deride PHP and the sort of self taught developers that were using it back then, WordPress pretty much is "the web" if you exclude Facebook and other global scale centralised web platforms, and the bits of the non FAANG et al owned web that aren't WordPress are very likely to be PHP too. Hell, even Facebook might still count as a PHP site.
In 30 years time, it won't be the most elegant or pure language or framework choices that dominate, it'll be the language/frameworks that people who don't care about elegance or purity end up using to get their idea onto the internet. If I had to guess, it'll likely be LLM written Python - deeply influenced and full of idioms from publicly available 2018-2024 era open source Python code that the AI grifters hoovered up and trained their initial models on.
majorchord
> A single $5 vps should be able to handle easily tens of thousands of requests...
Source:
harrisi
ranger_danger
1300 req/sec is not tens of thousands
leerob
(I work at Vercel) While it's good our spend limits worked, it clearly was not obvious how to block or challenge AI crawlers¹ from our firewall (which it seems you manually found). We'll surface this better in the UI, and also have more bot protection features coming soon. Also glad our improved image optimization pricing² would have helped. Open to other feedback as well, thanks for sharing.
¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...
²: https://vercel.com/changelog/faster-transformations-and-redu...
ilyabez
Hi, I'm the author of the blog (though I didn't post it on HN).
1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.
I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.
2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.
We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.
leerob
Not sure if you will see the reply, but please reach out lee at vercel.com and I'm happy to help.
zamalek
I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?
leerob
We’re working on a bot filtering system that blocks all non-browser traffic by default. Alongside that, we’re building a directory of verified bots, and you’ll be able to opt in to allow traffic only from those trusted sources. Hopefully shipping soon.
sroussey
Verified bots? You mean the companies that got big reading your info so now you know who they are, but not allow any new comers so the people that were taking the data all this time get rewarded by killing competition for them. lol.
cratermoon
> it clearly was not obvious how to block or challenge AI crawlers
majorchord
Setting the user-agent to curl (and maybe others) completely bypasses anubis.
bhouston
The issue is Vercel Image API is ridiculously expensive and also not efficient.
I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.
styfle
The article explains that they were using the old Vercel price and that the new price is much cheaper.
> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.
qudat
We use imgproxy at https://pico.sh
Works great for us
omnimus
Maybe atleast link to the project https://imgproxy.net/ next time? Your comment is basically an ad for your product. I am sure many clicked the link expecting there to be some image resizing proxy solution…
gngoo
I once sat down to calculate the costs of my app if it ever went viral being hosted at vercel. That has put me off on hosting anything on vercel ever or even touching NextJS. It feels like total vendor lock in once you have something running there, and you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself.
arkh
> you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself
The length to which many devs will go to not learn server management (or SQL).
einsteinx2
See also the entire job of “AWS Cloud Engineer” aka “I want to spend years learning how to manage proprietary infrastructure instead of just learning Linux server management” and the companies that hire them aka “we don’t have money to hire sysadmins to run servers, that’s crazy! Instead let’s pay the same salaries for a team of cloud engineers and be locked in to a single vendor paying 10x the price for infra!” It’s honestly mind boggling to me.
colonial
Server management has gotten vastly easier over time as well, especially if you're just looking to host stuff "for fun."
Even without fancy orchestration tools, it's very easy to put together a few containers on your dev machine (something like Caddy for easy TLS and routing + hand rolled images for your projects) and just shotgun them onto the cheapest server you can find. At that point the host is just a bootloader for Podman and can be made maximally idiot-proof (see Fedora CoreOS.)
sharps_xp
i also do the sit down a calculate exercise. i always end up down a rabbit hole of how to make a viral site as cheaply as possible. always ends up in the same place: redis, sqlite, SSE, on suspended fly machines, and a CDN.
jhgg
$5 to resize 1,000 images is ridiculously expensive.
At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).
Am I missing something here?
jsheard
It's the usual PaaS convenience tax, you end up paying an order of magnitude or so premium for the underlying bandwidth and compute. AIUI Vercel runs on AWS so in their case it's a compound platform tax, AWS is expensive even before Vercel adds their own margin on top.
cachedthing0
I would call it ignorance tax, paas can be fine if you know what you are doing.
leerob
(I work at Vercel) We moved to a transformation-based price: https://x.com/TheBuildLog/status/1892308957865111918
jhgg
Sweet! That's much more reasonable!
Banditoz
Yeah, curious too.
Can't the `convert` CLI tool resize images? Can that not be used here instead?
giantrobot
Whoa there Boomer, that doesn't sound like it uses enough npm packages! It also doesn't sound very web scale. /s
null
mvdtnz
You're not missing anything. A generation of programmers has been raised to believe platforms like Vercel / Next.js are not only normal, but ideal.
BonoboIO
Absolutely insane pricing, maybe for small blogs, but didn’t they calculate this trough?
Millions of episode, of course they will be visited and the optimization is run.
ilyabez
Hi, I'm the author of the blog (though I didn't post it on HN).
The site was originally secondary to our business and was built by a contractor. It was secondary to our business and we didn't pay much attention until we actually added the episode pages and the bots discovered them.
I saw a lot of disparaging comments here. It's definitely our fault for not understanding the implications of what the code was doing. We didn't mention the contractor in the post, because we didn't want to throw them under the bus. The accountability is all ours.
ashishb
As someone who maintains a Music+Podcast app as a hobby project, I intentionally have no servers for it.
You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.
bn-l
If you want to do something interesting with the feeds it would be harder.
ashishb
> If you want to do something interesting with the feeds it would be harder.
I am curious: What do you do with the feeds that can't be done in a client-side app? An aggregation across all users or recommendation system is one thing, but it can even be done via the clients sending analytics data back to the servers.
ilyabez
Hi, I'm the author of the blog (though I didn't post it on HN).
If you want to have a cross-platform experience (mobile + web), you'd have to have a server component.
We do transcript, chapter and summary extraction on the server (they are reused across customers), RSS fetching is optimized (so we don't hit the hosts from all the clients independently), our playlists are server-side (so they can be shared across platforms). As we build out the app, features like push notifications will require a server component too.
I agree with you that a podcast app can be built entirely client-side, but that will be limiting for more advanced and/or expensive use cases (like using LLMs).
VladVladikoff
Death by stupid micro services. Even at 1.5 mil pages, and the traffic they are talking about this could easily be hosted on a a fixed $80/month linode.
KennyBlanken
This isn't specific to microservices. I've seen two organizations with a lot of content have their website brought to its knees because multiple AI crawlers were hitting it.
One of them was pretending to be a very specific version of Microsoft Edge, coming from an Alibaba datacenter. Suuuuuuuuuuuuuuuuuure. Blocked its IP range and about ten minutes later a different subnet was hammering away again. I ended up just blocking based off the first two octets; the client didn't care, none of their visitors are from China.
All of this was sailing right through Cloudflare.
VladVladikoff
I’ve dealt with AI crawlers. I’ve even seen 8 different AI crawlers at once. And yes some have been very aggressive, and I have even blocked some who are particularly bad (ignoring robots.txt rules). But their traffic is a tiny fraction of what my infrastructure sees on a regular basis. A well optimized platform, with good caching, shouldn’t really struggle with a few crawlers.
afarah1
Honest question, why is rate limiting insufficient?
Can be done in two lines in nginx which is not just a common web server but also used as an API gateway or proxy.
You can rate limit by IP pretty aggressively without affecting human traffic.
Aeolun
One /24 of IP’s hammering on your website at a rate limited 2 rps is still a combined 500/s. I’m not sure many sites can sustain that.
GodelNumbering
Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.
Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls
I ended up banning every SEO bot in robots.txt and a bunch of other bots
marcusb
I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)
Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.
GodelNumbering
I checked IPs on those, they belonged to MSFT
hansvm
Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?
If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?
marcusb
I haven't tested that behavior, sorry.
nullorempty
Yeah, AI crawlers - add that to my list of phobias. Though for a bootstrapped startup why not look to cut all recurrent expenses and just deploy imagemagik that I am sure will do the trick for less.
outloudvi
Vercel has a fairly generous free quota and a non-negligible high pricing scheme - I think people still remember https://service-markup.vercel.app/ .
For the crawl problem, I want to wait and see whether robots.txt is proved enough to stop GenAI bots from crawling since I confidently believe these GenAI companies are too "well-behaved" to respect robots.txt.
otherme123
This is my experience with AI bots. This is my robots.txt:
User-agent: * Crawl-Delay: 20
Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.
When a very well known AI bot crawled my site in august, they fired up everything: fail2ban put them temporarily in jail multiple times, the nginx request limit per ip was serving 426 and 444 to more than half their requests (but they kept hammering the same Urls), and some human users contacted me complaining about the site going 503. I had to block the bot IPs at the firewall. They ignore (if they even read) the robots.txt.
dvrj101
Nope they have been ignoring robots.txt since the start. There are multiple posts all over the internet.
randunel
> Optimizing an image meant that Next.js downloaded the image from one of those hosts to Vercel first, optimized it, then served to the users.
So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.
ilyabez
Hi, I'm the author of the blog (though I didn't post it on HN).
I'd encourage you to read up on how the podcast ecosystem works.
Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.
All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.
This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.
randunel
Do you or do you not visit and respect "robots.txt" on the hosts you've mentioned in your blog post as downloading via next.js?
mediumsmart
Don’t feed the bots. Why a pixel image? Take an svg and make it pulse while playing.
A single $5 vps should be able to handle easily tens of thousands of requests...
Not that much for simple thumbnails in addition. So sad that the trend of "fullstack" engineers being just frontend js/ts devs took off with thousands of companies having no clue at all about how to serve websites, backends and server engineering...