A look at Cloudflare's AI-coded OAuth library

74 comments

·June 8, 2025

afro88

> What this interaction shows is how much knowledge you need to bring when you interact with an LLM. The “one big flaw” Claude produced in the middle would probably not have been spotted by someone less experienced with crypto code than this engineer obviously is. And likewise, many people would probably not have questioned the weird choice to move to PBKDF2 as a response

For me this is the key takeaway. You gain proper efficiency using LLMs when you are a competent reviewer, and for lack of a better word, leader. If you don't know the subject matter as well as the LLM, you better be doing something non-critical, or have the time to not trust it and verify everything.

donatj

My question is kind of in this brave new world, where do the domain experts come from? Whose going to know this stuff?

svara

LLMs make learning new material easier than ever. I use them a lot and I am learning new things at an insane pace in different domains.

The maximalists and skeptics both are confusing the debate by setting up this straw man that people will be delegating to LLMs blindly.

The idea that someone clueless about OAuth should develop an OAuth lib with LLM support without learning a lot about the topic is... Just wrong. Don't do that.

But if you're willing to learn, this is rocket fuel.

junon

On the flip side, I wanted to see what common 8 layer PCB stackups were yesterday. ChatGPT wasn't giving me an answer that really made sense. After googling a bit, I realized almost all of the top results were AI generated, and also had very little in the way of real experience or advice.

It was extremely frustrating.

elvis10ten

> LLMs make learning new material easier than ever. I use them a lot and I am learning new things at an insane pace in different domains.

With learning, aren’t you exposed to the same risks? Such that if there was a typical blind spot for the LLM, it would show up in the learning assistance and in the development assistance, thus canceling out (i.e unknown unknowns)?

Or am I thinking about it wrongly?

blibble

how do you gain anything useful from a sycophantic tutor that agrees with everything you say, having being trained to behave as if the sun shines out of your rear end?

making mistakes is how we learn, and if they are never pointed out...

belter

> But if you're willing to learn, this is rocket fuel.

LLMs will tell you 1 or 2 lies for each 20 facts. Its a hard way to learn. They cant even get their urls right...

maegul

This, for me, has been the question since the beginning. I’m yet to see anyone talk/think about the issue head on too. And whenever I’ve asked someone about it, they’ve not had any substantial thoughts.

PUSH_AX

Engineers will still exist and people will vibe code all kinds of things into existence. Some will break in spectacular ways, some of those projects will die, some will hire a real engineer to fix things.

I cannot see us living in a world of ignorance where there are literally zero engineers and no one on the planet understands what's been generated. Weirdly we could end up in a place where engineering skills are niche and extremely lucrative.

shswkna

Most important question on this entire topic.

Fast forward 30 years and modern civilisation is entirely dependent on our AI’s.

Will deep insight and innovation from a human perspective perhaps come to a stop?

Earw0rm

No. Even with power tools, construction and joinery are physical work and require strength and skill.

What is new is that you'll need the wisdom to figure out when the tool can do the whole job, and where you need to intervene and supervise it closely.

So humans won't be doing any less thinking, rather they'll be putting their thinking to work in better ways.

qzw

No, but it'll become a hobby or artistic pursuit, just like running, playing chess, or blacksmithing. But I personally think it's going to take longer than 30 years.

risyachka

Use it or lose it.

Experts will become those who use llm to learn and not to write code for them or solve tasks for them so they can build that skill.

kypro

In a few years hopefully the AI reviewers will be far more reliable than even the best human experts. This is generally how competency progresses in AI...

For example, at one point a human + computer would have been the strongest combo in chess, now you'd be insane to allow a human to critic a chess bot because they're so unlikely to add value, and statistically a human in the loop would be far more likely to introduce error. Similar things can be said in fields like machine vision, etc.

Software is about to become much higher quality and be written at much, much lower cost.

sarchertech

My prediction is that for that to happen we’ll need to figure out a way to measure software quality in the way we can measure a chess game, so that we can use synthetic data to continue improving the models.

I don’t think we are anywhere close to doing that.

marcusb

I’m puzzled when I hear people say ‘oh, I only use LLMs for things I don’t understand well. If I’m an expert, I’d rather do it myself.’

In addition to the ability to review output effectively, I find the more closely I’m able to describe what I want in the way another expert in that domain would, the better the LLM output. Which isn’t really that surprising for a statistical text generation engine.

diggan

I guess it depends. In some cases, you don't have to understand the black box code it gives you, just that it works within your requirements.

For example, I'm horrible at math, always been, so writing math-heavy code is difficult for me, I'll confess to not understanding math well enough. If I'm coding with an LLM and making it write math-heavy code, I write a bunch of unit tests to describe what I expect the function to return, write a short description and give it to the LLM. Once the function is written, run the tests and if it passes, great.

I might not 100% understand what the function does internally, and it's not used for any life-preserving stuff either (typically end up having to deal with math for games), but I do understand what it outputs, and what I need to input, and in many cases that's good enough. Working in a company/with people smarter than you tends to make you end up in this situation anyways, LLMs or not.

Though if in the future I end up needing to change the math-heavy stuff in the function, I'm kind of locked into using LLMs for understanding and changing it, which obviously feels less good. But the alternative is not doing it at all, so another tradeoff I suppose.

I still wouldn't use this approach for essential/"important" stuff, but more like utility functions.

_heimdall

That's why outsource most other things in our life though, why would it be different with LLMs?

People don't learn how a car works before buying one, they just take it to a mechanic when it breaks. Most people don't know how to build a house, they have someone else build it and assume it was done well.

I fully expect people to similarly have LLMs do what the person doesn't know how and assume the machine knew what to do.

bradly

I've found llms are very quick to add defaults, fallbacks, rescues–which all makes it very easy for code to look like it is working when it is not or will not. I call this out three different places in my CLAUDE.md trying to adjust for this, and still occasionally get.

ajmurmann

I've been using an llm to do much of a k8s deployment for me. It's quick to get something working but I've had to constantly remind it to use secrets instead of committing credentials in clear text. A dangerous way to fail. I wonder if in my case this is caused by the training data having lots of examples from online tutorials that omit security concerns to focus on the basics.

diggan

> my case this is caused by the training data having

I think it's caused by you not having a strong enough system prompt. Once you've built up a slightly reusable system prompt for coding or for infra work, where you bit by bit build it up while using a specific model (since different models respond differently to prompts), you end up getting better and better responses.

So if you notice it putting plaintext credentials in the code, add to the system prompt to not do that. With LLMs you really get what you ask for, and if you miss to specify anything, the LLM will do whatever the probabilities tells it to, but you can steer this by being more specific.

Imagine you're talking to a very literal and pedantic engineer who argues a lot on HN and having to be very precise with your words, and you're like 80% of the way there :)

ajmurmann

Yes, you are definitely right on that. I still find it a concerning failure mode. That said, maybe it's no worse than a junior copying from online examples without reading all the text some the code which of course has been very common also.

ants_everywhere

> It's quick to get something working but I've had to constantly remind it to use secrets instead of committing credentials in clear text.

This is going to be a powerful feedback loop which you might call regression to the intellectual mean.

On any task, most training data is going to represent the middle (or beginning) of knowledge about a topic. Most k8s examples will skip best practices, most react apps will be from people just learning react, etc.

If you want the LLM to do best practices in every knowledge domain (assuming best practices can be consistently well defined), then you have to push it away from the mean of every knowledge domain simultaneously (or else work with specialized fine tuned models).

As you continue to add training data it will tend to regress toward the middle because that's where most people are on most topics.

jstummbillig

You will always trust domain experts at some junction; you can't build a company otherwise. The question is: Can LLMs provide that domain expertise? I would argue, yes, clearly, given the development of the past 2 years, but obviously not on a straight line.

sarchertech

I just finished writing a Kafka consumer to migrate data with heavy AI help. This was basically best case a scenario for AI. It’s throw away greenfield code in a language I know pretty well (go) but haven’t used daily in a decade.

For complicated reasons the whole database is coming through on 1 topic, so I’m doing some fairly complicated parallelization to squeeze out enough performance.

I’d say overall the AI was close to a 2x speed up. It mostly saved me time when I forgot the go syntax for something vs looking it up.

However, there were at least 4 subtle bugs (and many more unsubtle ones) that I think anyone who wasn’t very familiar with Kafka or multithreaded programming would have pushed to prod. As it is, they took me a while to uncover.

On larger longer lived codebases, I’ve seen something closer to a 10-20% improvement.

All of this is using the latest models.

Overall this is at best the kind of productivity boost we got from moving to memory managed languages. Definitely not something that is going to replace engineers with PMs vibe coding anytime soon (based on rate of change I’ve seen over the last 3 years).

My real worry is that this is going to make mid level technical tornadoes, who in my experience are the most damaging kind of programmer, 10x as productive because they won’t know how to spot or care about stopping subtle bugs.

I don’t see how senior and staff engineers are going to be able to keep up with the inevitable flood of reviews.

I also worry about the junior to senior pipeline in a world where it’s even easier to get something up that mostly works—we already have this problem today with copy paste programmers, but we’ve just make copy paste programming even easier.

I think the market will eventually sort this all out, but I worry that it could take decades.

aiono

I agree with the last paragraph about doing this yourself. Humans have tendency to take shortcuts while thinking. If you see something resembling what you expect for the end product you will be much less critical of it. The looks/aesthetics matter a lot on finding problems with in a piece of code you are reading. You can verify this by injecting bugs in your code changes and see if reviewers can find them.

On the other hand, when you have to write something yourself you drop down to slow and thinking state where you will pay attention to details a lot more. This means that you will catch bugs you wouldn't otherwise think of. That's why people recommend writing toy versions of the tools you are using because writing yourself teaches a lot better than just reading materials about it. This is related to know our cognition works.

ape4

The article says there aren't too many useless comments but the code has:

    // Get the Origin header from the request
    const origin = request.headers.get('Origin');

HocusLocus

I suggest they freeze a branch of it, then spawn some AIs to introduce and attempt to hide vulnerabilities, and another to spot and fix them. Every commit is a move, and try to model the human evolution of chess.

menzoic

LLMs are like power tools. You still need to understand the architecture, do the right measurements, and apply the right screw to the right spot.

epolanski

Part of me this "written by LLM" has been a way to get attention on the codebase and plenty of free reviews by domain expert skeptics, among the other goals (pushing AI efficiency to investors, experimenting, etc).

dweekly

An approach I don't see discussed here is having different agents using different models critique architecture and test coverage and author tests to vet the other model's work, including reviewing commits. Certainly no replacement for human in the loop but it will catch a lot of goofy "you said to only check in when all the tests pass so I disabled testing because I couldn't figure out how to fix the tests".

roxolotl

> Many of these same mistakes can be found in popular Stack Overflow answers, which is probably where Claude learnt them from too.

This is what keeps me up at night. Not that security holes will inevitably be introduced, or that the models will make mistakes, but that the knowledge and information we have as a society is basically going to get frozen in time to what was popular on the internet before LLMs.

tuxone

> This is what keeps me up at night.

Same here. For some of the services I pay, say the e-mail provider, the fact that they openly deny using LLMs for coding would be a plus for me.

djoldman

> At ForgeRock, we had hundreds of security bugs in our OAuth implementation, and that was despite having 100s of thousands of automated tests run on every commit, threat modelling, top-flight SAST/DAST, and extremely careful security review by experts.

Wow. Anecdotally it's my understanding that OAuth is ... tricky ... but wow.

Some would say it's a dumpster fire. I've never read the spec or implemented it.

jofzar

Oauth is so annoying, there is so much niche to it.

stuaxo

The times I've been involved with implementations it's been really horrible.

bandoti

Honestly, new code always has bugs though. That’s pretty much a guarantee—especially if it’s somewhat complex.

That’s why companies go for things that are “battle tested” like vibe coding. ;)

Joke aside—I like how Anthropic is using their own product in a pragmatic fashion. I’m wondering if they’ll use it for their MCP authentication API.

jajko

Hundreds of thousands of tests? That sounds like quantity > quality or outright llm-generated ones, who even maintains them?

nmadden

This was before LLMs. It was a combination of unit and end-to-end tests and tests written to comprehensively test every combination of parameters (eg test this security property holds for every single JWT algorithm we support etc). Also bear in mind that the product did a lot more than just OAuth.

kcatskcolbdi

Really interesting breakdown. What jumped out to me wasn’t just the bugs (CORS wide open, incorrect Basic auth, weak token randomness), but how much the human devs seemed to lean on Claude’s output even when it was clearly offbase. That “implicit grant for public clients” bit is wild; it’s deprecated in OAuth 2.1, and Claude just tossed it in like it was fine, and then it stuck.

null

[deleted]

keybored

Oh another one,[1] cautious somewhat-skeptic edition.

[1] https://news.ycombinator.com/item?id=44205697