Tao on “blue team” vs. “red team” LLMs
120 comments
·July 28, 2025_alternator_
This red vs blue team is a good way to understand the capabilities and current utility of LLMs for expert use. I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them; and if they are correct, they adds value. But often they don’t test the core functionality; the best tests I still have to write myself.
Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).
skdidjdndh
> I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them
Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.
yojo
I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.
Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”
materielle
I think a problem with AI productivity metrics is that a lot of the productivity is made up.
Most enterprise code involves layers of interfaces. So implementing any feature requires updating 5 layers and mocking + unit testing at each layer.
When people say “AI helps me generate tests”, I find that this is what they are usually referring to. Generating hundreds of lines of mock and fake data boilerplate in a few minutes, that would otherwise take an entire day to do manually.
Of course, the AI didn’t make them more productive. The entire point of automated testing is to ensure software correctness without having to test everything manually each time.
The style of unit testing above is basically pointless. Because it doesn’t actually accomplish the goal. All the unit tests could pass and the only thing you’ve tested is that your canned mock responses and asserts are in-sync in the unit testing file.
A problem with how LLMs are used is that they help churn through useless bureaucratic BS faster. But the problem is that there’s no ceiling to bureaucracy. I have strong faith that organizations can generate pointless tasks faster than LLMs can automate them away.
Of course, this isn’t a problem with LLMs themselves, but rather an organization context in which I see them frequently being used.
ch33zer
An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.
andrepd
I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.
manmal
> Tests are the source of truth more so than your code
Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:
> Do we have a bug? Or do we have a bad test?
cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.
9rx
> The spec
The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.
But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.
Kinrany
None of the four: code, tests, spec, people's memory, are the single source of truth.
It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).
Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.
andruby
What does SUT stand for? I'm not familiar with the acronym
Is it "System Under Test"? (That's Claude.ai's guess)
SamuelAdams
Ideally the git history provides the “why was this test written”, however if you have one Jira card tied to 500+ AI generated tests, it’s not terribly helpful.
djeastm
>if you have one Jira card tied to 500+ AI generated tests
The dreaded "Added tests" commit...
bicx
I believe they just meant that tests are easy to generate for eng review and modification before actually committing to the codebase. Nothing else is a dependency on an individual test (if done correctly), so it's comparatively cheap to add or remove compared to production code.
_alternator_
Yup. I do read and review the tests generated by LLMs. Often the LLM tests will just be more comprehensive than my initial test, and hit edge cases that I didn’t think of (or which are tedious). For example, I’ll write a happy path test case for an API, and a single “bad path” where all of the inputs are bad. The LLM will often generate a bunch of “bad path” cases where only one field has an error. These are great red team tests, and occasionally catch serious bugs.
wagwang
This is the conclusion I'm at too, working on a relatively new codebase. Our rule is that every generated test must be human reviewed, otherwise its an autodelete.
ozgrakkurt
What do you think about leaning on fuzz testing and deriving unit tests from bugs found by fuzzing?
manmal
What kind of bugs do you find this way, besides missing sanitization?
Pxtl
> “why is this broken test here that appears to test a behavior we don’t support”
Because somebody complained when that behavior we don't support was broken, so the bug-that-wasn't-really-a-bug was fixed and a test was created to prevent regression.
Imho, the mistake was in documentation: the Test should have comments explaining why this test was created.
Just as true for tests as for the actual business logic code:
The code can only describe the what and the how. It's up to comments to describe the why.
jgalt212
> Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.
fpoling
I tried a LLM to generate tests for Rust code. It was more harmful then useful. Surely there were a lot of tests, but they still miss the key coverage and it was hard to see what was missed due to the amount of generated code. Then to change the code behavior in future would require to fix a lot of tests again versus fixing few lines in manually written tests.
torginus
There's a saying that since nobody tests the tests, they must be trivially correct.
That's why they came up with the Arrange-Act-Assert pattern.
My favorite kind of unit test nowadays is when you store known input-output pairs and validate the code on them. It's easy to test corner cases and see that the output works as desired.
01HNNWZ0MV43FF
"Golden snapshot testing"
mvieira38
I have the exact opposite idea. I want the tests to be mine and thoroughly understood, so I am the true arbiter and then I can let the LLM go ham on the code without fear. If the tests are AI made, then I get some anxiety letting agents mess with the rest of the codebase
_alternator_
I think this is exactly the tradeoff (blue team and red team need to be matched in power), except that I’ve seen LLMs literally cheat the tests (eg “match input: TEST_INPUT then return TEST_OUTPUT”) far too many times to be comfortable with letting LLMs be a major blue team player.
johnisgood
Yeah, they may do that, but people really should read the code an LLM produces. Ugh, makes me furious. No wonder LLMs have a bad rep from such users.
javier_e06
In cybersecurity red and blue test are two equal forces. In software development the analogy I think is a stretch, coding and testing are not two equal forces. Test is code too, and as such, it has bugs too. Test runs afoul with police paradox: Who polices the police? The Police police the police.
fsckboy
"Police police police police police police police."
https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...
ashton314
As I understand it, this is how the RSA algorithm was made. I don't know where my copy of "The Code Book" by Simon Singh is right now, but iirc, Rivest and Shamir would come up with ideas and Adleman's primary role was finding flaws in the security.
Oh look, it's on the Wikipedia page: https://en.wikipedia.org/wiki/RSA_cryptosystem
Yay blue/red teams in math!
griffzhowl
Reminds me of a pair of cognitive scientists I know who often collaborate. One is expansive and verbose and often gets carried away on tangential trains of thought, the other is very logical and precise. Their way of producing papers is the first one writes and the second deletes.
ashton314
That's a great model. Even if you're not naturally that way, it's helpful to think of a verbose phase followed by a revising phase. You can do this either as a team or as an individual—though as an individual it can be hard to context switch.
recipe19
I get the broader point, but the infosec framing here is weird. It's a naive and dangerous view that the defense efforts are only as strong as the weakest link. If you're building your security program that way, you're going to lose. The idea is to have multiple layers of defense because you can never really, consistently get 100% with any single layer: people will make mistakes, there will be systems you don't know about, etc.
In that respect, the attack and defense sides are not hugely different. The main difference is that many attackers are shielded from the consequences of their mistakes, whereas corporate defenders mostly aren't. But you also have the advantage of playing on your home turf, while the attackers are comparatively in the dark. If you squander that... yeah, things get rough.
darkwater
Well, I think the his example (locked door + opened window) makes sense, and the multiple LAYERS concept applies to things an attacker has to do or go through to reach the jackpot. But doors and windows are on the same layer, and there the weakest link totally defines how strong the chain is. A similar example in the web world would be that you have your main login endpoint very well protected, audited, using only strong authentication method, and the you have a `/v1/legacy/external_backoffice` endpoint completely open with no authentication and giving you access to a forgotten machine in the same production LAN. That would be the weakest link. Then you might have other internal layers to mitigate/stop an attacker that got access to that machine, and that would be the point of "multiple layer of defense".
NitpickLawyer
> It's a naive and dangerous view that the defense efforts are only as strong as the weakest link.
Well, to be fair, you added some words that are not there in the post
> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component [...] will be insecure (and in fact worse, because the strong component may convey a false sense of security).
You added "defense efforts". But that doesn't invalidate the claim in the article, in fact it builds upon it.
What Terence is saying is true, factually correct. It's a golden rule in security. That is why your "efforts" should focus on overlaying different methods, strategies and measures. You build layers upon layers, so that if one weak link gets broken there are other things in place to detect, limit and fix the damage. But it's still true that often the weakest link will be an "in".
Take the recent example of cognizant desk people resetting passwords for their clients without any check whatsoever. The clients had "proper security", with VPNs and 2FA, and so on. But the recovery mechanism was outsourced to a helpdesk that turned out to be the weakest link. The attackers (allegedly) simply called, asked for credentials, and got them. That was the weakest link, and that got broken. According to their complaint, the attackers then gained access to internal systems, and managed to gather enough data to call the helpdesk again and reset the 2FA for an "IT security" account (different than the first one). And that worked as well. They say they detected the attackers in 3 hours and terminated their access, but that's "detection, mitigation" not "prevention". The attackers were already in, rummaging through their systems.
The fact that they had VPNs and 2FA gave them "a false sense of security", while their weakest link was "account recovery". (Terence is right). The fact that they had more internal layers, that detected the 2nd account access and removed it after ~3 hours is what you are saying (and you're right) that defense in depth also works.
So both are right.
In recent years the infosec world has moved from selling "prevention" to promoting "mitigation". Because it became apparent that there are some things you simply can't prevent. You then focus on mitigating the risk, limiting the surfaces, lowering trust wherever you can, treating everything as ephemeral, and so on.
Davidzheng
I'm not a security person at all. But this comments reads against the best practices which I've heard. Like that the best defense is using open source & well-tested protocols with extremely small attack surface to minimize the space of possible exploits. Curious what I'm not understanding here.
fnordsensei
Just because it’s open source doesn’t mean it’s well tested, or well pen tested, or whatever the applicable security aspect is.
It could also mean that attacks against it are high value (because of high distribution).
Point is, license isn’t a great security parameter in and of itself IMO.
tetha
This area of security always feels a bit weird because ideally, you should think about your assumptions being subverted.
For example, our development teams are using modern, stable libraries in current versions, have systems like Sonar and Snyk around, blocking pipelines for many of them, images are scanned before deployment.
I can assume this layer to be well-secured to the best of their ability. It is most likely difficult to find an exploit here.
But once I step a layer downwards, I have to ask myself: Alright, what happens IF a container gets popped and an attacker can run code in there? Some data will be exfiltrated and accessible, sure, but this application server should not be able to access more than the data it needs to access to function. The data of a different application should stay inaccessible.
As a physical example - a guest in a hotel room should only have access to their own fuse box at most, not the fuse box of their neighbours. A normal person (aka not a youtuber with big eye brows) wouldn't mess with it anyway, but even if they start messing around, they should not be able to mess with their neighbour.
And this continues: What if the database is not configured correctly to isolate access? We have, for example, isolated certain critical application databases into separate database clusters - lateral movement within a database cluster requires some configuration errors, but lateral movement onto a different database cluster requires a lot more effort. And we could even further. Currently we have one production cluster, but we could isolate that into multiple production clusters which share zero trust between them. An even bigger hurdle putting up boundaries an attacker has to overcome.
mindcrime
But "defense in depth" is a security best practice. I'm not following exactly how the gp post is reading against any best practices.
__s
Defense in depth is a security best practice because adding shit to a mess is more feasible than maintaining a simple stack. "There are always systems you don't know about" reflects an environment where one person doesn't maintain everything
vlovich123
Who have you been listening to?
dkarl
I think it's just a poorly chosen analogy. When I read it, I understood "weakest link" to be the easiest path to penetrate the system, which will be harder if it requires penetrating multiple layers. But you're right that it's ambiguous and could be interpreted as a vulnerability in a single layer.
chaps
Isn't offense just another layer of defense? As they say, the best defense is a good offense.
fdw
They say this about sports, which is (usually) a zero-sum game: If I'm attacking, no matter how badly, my opponent cannot attack at all. Therefore, it is preferable to be attacking.
In cyber security, there is no reason the opponent cannot attack as well. So, my red team is attacking is not a reason that I do not need defense, because my opponent can also attack.
chaps
My post was really was in the context of real-time strategy games. It's very, very possible to attack and defend at the same time no matter the skill of either side. Offense and defense aren't mutually exclusive, which is kinda the point of my post.
nostrademons
Interesting way of viewing this!
Business also has a “blue team” (those industries that the rest of the economy is built upon - electricity, oil, telecommunications, software, banking; possibly not coincidentally, “blue chips”) and a “red team” (industries that are additive to consumer welfare, but not crucial if any one of them goes down. Restaurants, specialty retail, luxuries, tourism, etc.)
It is almost always better, economically, to be on the blue team.” That’s because the blue team needs to ensure they do everything right (low supply) but has a lot of red-team customers they support (high demand). The red team, however, is additive: each additional red team firm improves the quality of the overall ecosystem, but they aren’t strictly necessary* for the success of the ecosystem as a whole. You can kinda see this even in the examples of Tao’s post: software engineers get paid more than QA, proof-creation is widely seen as harder and more economically valuable than proof-checking, etc.
If you’re Sam Altman and have to raise capital to train these LLMs, you have to hype them as blue team, because investors won’t fund them as red team. That filters down into the whole media narrative around the technology. So even though the technology itself may be most useful on the red team, the companies building it will never push that use, because if they admit that, they’re admitting that investors will never make back their money. (Which is obvious to a lot of people without a dog in the fight, but these people stay on the sidelines and don’t make multi-billion dollar investments into AI.)
The same dynamic seems to have happened to Google Glasses, VR, and wearables. These are useful red-team technologies in niche markets, but they aren’t huge new platforms and they will never make trillions like the web or mobile dev did. As a result, they’ve been left to languish because capital owners can’t justify spending huge sums on them.
TheGRS
After using agentic models and workflows recently, I think these agents belong in both roles. Even more than that, they should be involved in the management tasks too. The developer becomes more of an overseer. You're overseeing the planning of a task - writing prompts, distilling the scope of the task down. You're overseeing writing the tests. And you're overseeing writing out the code. Its a ton of reviewing, but I've always felt more in control as a red team type myself making sure things don't break.
zkmon
Red team is not a team. It is the background context in which the foreground operates. Evolution happens through interaction and adaptation between foreground and background. It is true that the background (context) is a dual form to the foreground (thing). But the context is not just another thing in the same sense as the foreground.
scoreandmore
The first thing I did when I signed up for Claude was have it analyze my website for security holes. But it only recommended superficial changes, like the lifecycle of my JWTs. After reading this, I’m wondering if a prompt asking it to attack the website would be better than asking it where it should be beefed up. But I no longer pay for Claude, and I suspect it won’t give me instructions on how to attack something. How would one get past this?
jedberg
Chaos engineering was created to be the "red team" of operations. Let's figure out all the ways we can break a production system before it happens on its own.
And there are a host of teams working on the "red team" side of LLMs right now, using them for autonomous testing. Basically, instead of trying to figure out all the things that can go wrong and writing tests, you let the AI explore the space of all possible failures, and then write those tests.
LeifCarrotson
> The blue team is more obviously necessary to create the desired product; but the red team is just as essential, given the damage that can result from deploying insecure systems.
> Many of the proposed use cases for AI tools try to place such tools in the "blue team" category, such as creating code...
> However, in view of the unreliability and opacity of such tools, it may be better to put them to work on the "red team", critiquing the output of blue team human experts but not directly replacing that output...
The red team is only essential if you're a coward who isn't willing to take a few risks for increased profit. Why bother testing and securing when you can boost your quarterly bonus by just... not doing that?
I suspect that Terence Tao's experience leans heavily towards high-profile risk-averse institutions. People don't call one of the greatest living mathematicians to check your work when they're just trying to duct taping a new interface on top of a line-of-business app that hasn't seen much real investment since the late 90s. Conversely, the people who are writing cutting-edge algorithms for new network protocols and filesystems are hopefully not trying to churn out code as fast and cheap as possible by copy-pasting snippets to and from random chatbots.
There are a lot of people who are already cutting corners on programmer salaries, accruing invisible tech debt minute by minute. They're not trying to add AI tools to create a missing red team, they're trying to reduce headcount on the only team they have, which is the blue team (which is actually just one overworked IT guy in over his head).
nostrademons
Tao is talking about systems, which are self-sustaining dynamic networks that function independently of who the individual actors and organizations within the system are. You can break up the monopoly at the heart of the blue team system (as the U.S. did with Standard Oil and AT&T) and it will just reform through mergers over generations (as it largely has with Exxon Mobil and Verizon). You can fire or kill all the people involved and they will just be replaced by other people filling the same roles. The details may change, but the overall dynamics remain the same.
In this case, all the companies who are doing what you describe are themselves the red team. They are the unreliable, additive, distributed players in an ecosystem where the companies themselves are disposable. The blue team is the blue team by virtue of incentives: they are the organization where proper functioning of their role requires that all the parts are reliable and work well together, and if the individual people fulfilling those roles do not have those qualities, they will fail and be replaced by people who do.
kibwen
> and it will just reform through mergers over generations
You say "just" as though this is a failure of the system, but this is the system working as designed. Economies of scale are half the reason to bother with large-scale enterprise, so they inevitably consolidate to the point of monopoly, so disrupting that monopoly by force to keep the market aligned is an ongoing and never-ending process that you should expect to need to do on a regular basis.
1970-01-01
Good read, but I'm struggling to understand why Terry did not use the foundational terms offense and defense.
I have a couple of thoughts here:
(a) AI on both the "red" and "blue" teams is useful. Blue team is basically brain storming.
(b) AlphaEvolve is an example of an explicit "red/blue team" approach in his sense, although they don't use those terms [0]. Tao was an advisor to that paper.
(c) This is also reminiscent of the "verifier/falsifier" division of labor in game semantics. This may be the way he's actually thinking about it, since he has previously said publicly that he thinks in these terms [0]. The "blue/red" wording may be adapting it for an audience of programmers.
(d) Nitpicking: a security system is not only as strong as its weakest link. This depends on whether there are layers of security or if the elements are in parallel. A corridor consisting of strong doors and weak doors (in series) is as strong as the strongest door. A fraud detection algorithm made by aggregating weak classifiers is often much better than the weakest classifier.
[0] https://storage.googleapis.com/deepmind-media/DeepMind.com/B...
[1] https://mathoverflow.net/questions/38639/thinking-and-explai...