Skip to content(if available)orjump to list(if available)

AI will make formal verification go mainstream

bkettle

I think formal verification shines in areas where implementation is much more complex than the spec, like when you’re writing incomprehensible bit-level optimizations in a cryptography implementation or compiler optimization phases. I’m not sure that most of us, day-to-day, write code (or have AI write code) that would benefit from formal verification, since to me it seems like high-level programming languages are already close to a specification language. I’m not sure how much easier to read a specification format that didn’t concern itself with implementation could be, especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.

pron

> to me it seems like high-level programming languages are already close to a specification language

They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).

To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

nyrikki

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

[0] https://learntla.com/core/operators.html

anon-3988

> To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

vs

Giving them a framework or a language that does not have for loop?

Edit: If by formal verification you mean type checking. That I very much agree.

goryDeets

[dead]

marcosdumay

There are many really important properties to enforce even on the most basic CRUD system. You can easily say things like "an anonymous user must never edit any data, except for the create account form", or "every user authorized to see a page must be listed on the admin page that lists what users can see a page".

People don't verify those because it's hard, not for lack of value.

nextos

Yes, in fact there is research on type systems to ensure information flow control, avoiding unauthorized data access by construction.

Concrete Semantics [1] has a little example in §9.2.

[1] http://concrete-semantics.org/concrete-semantics.pdf

bkettle

Yeah fair enough. I can definitely see the value of property-based verification like this and agree that useful properties could be easy to express and that LLMs could feasibly verify them. I think full verification that an implementation implements an entire spec and nothing else seems much less practical even with AI, but of course that is just one flavor of verification.

socketcluster

Yes. I feel like people who are trying to push software verification have never worked on typical real-world software projects where the spec is like 100 pages long and still doesn't fully cover all the requirements and you still have to read between the lines and then requirements keep changing mid-way through the project... Implementing software to meet the spec takes a very long time and then you have to invest a lot of effort and deep thought to ensure that what you've produced fits within the spec so that the stakeholder will be satisfied. You need to be a mind-reader.

It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.

Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.

The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.

Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.

And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.

robot-wrangler

The whole perspective of this argument is hard for me to grasp. I don't think anyone is suggesting that formal specs are an alternative to code, they are just an alternative to informal specs. And actually with AI the new spin is that they aren't even a mutually exclusive alternative.

A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.

It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.

simonw

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

Davidzheng

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

roadside_picnic

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

nextos

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

In fact, there are already serious industrial efforts to generate Dafny using LLMs.

Besides, some of the largest verification efforts have been achieved with this language [1].

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

bcrosby95

Ada when?

It even lets you separate implementation from specification.

mccoyb

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

idiotsecant

How expensive is incredibly expensive?

mccoyb

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

ManuelKiessling

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

Wowfunhappy

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

victorbjorklund

AI can do that too? If you have a web app it can use playwright to test functionality and take screenshots to see if it looks right.

pron

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer their key feature. Rather, it is something else that distinguishes them from the harder problems in NP.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

oxag3n

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

anon-3988

> I suspect debugging is not that straightforward to LLM'ize.

Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.

jimmaswell

How long ago was this? I've had outstansingly impressive results asking Copilot Chat with Sonnet 4.5 or ChatGPT to debug difficult multithreaded C++.

oxag3n

Few months back with ChatGPT 5. Multi-threaded Rust & custom async runtime, data integrity bug, reproduced every ~5th run.

simonw

Have you tried running gdb from a Claude Code or Codex CLI session?

oxag3n

No, I'm in academia and the goal is not code or product launch. I find research process to struggle a lot once someone solves a problem instead of you.

I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.

christophilus

Debugging has been excellent for me with Opus 4.5 and Claude Code.

fragmede

> Where they'd get training data?

They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.

CobrastanJorji

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

null

[deleted]

jijijijij

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

pron

> it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.

Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.

I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.

It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.

[1]: https://news.ycombinator.com/item?id=46207505

thatoneengineer

I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.

TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.

pron

I completely agree it's appealing, I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant. In other words, I also want agents to be able to consistently prove software correct, but if they're actually able to accomplish that, then they could just as likely be able to do everything else in the production of that software (including gathering requirements and writing the spec) without me in the loop.

1121redblackgo

First human robot war is us telling the AI/robots 'no', and them insisting that insert technology here is good for us and is the direction we should take. Probably already been done, but yeah, this seems like the tipping point into something entirely different for humanity.

pron

... if it's achievable at all in the near future! But we don't know that. It's just that if we assume AI can do X, why do we assume it cannot, at the same level of capability, also do Y? Maybe the tipping point where it can do both X and Y is near, but maybe in the near future it will be able to do neither.

manindahat

At best, not reading generated code results in a world where no humans are able to understand our software, and we regress back to the stone age after some critical piece breaks, and nobody can fix it.

At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.

Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.

vlovich123

If you believe the sentient AI is that capable and intention based to replicate itself, what’s stopping it from trying to engage in underhanded code where it comments and writes everything in ways that look fine but actually have vulnerabilities it can exploit to achieve what it wants? Or altering the system that runs your code so that the code that gets deployed is different.

bob1029

I don't know if it's "formal", but I think combining Roslyn analyzers with an LLM and a human in the loop could be game changing. The analyzers can enforce intent over time without any regressions. The LLM can write and modify analyzers. Analyzers can constrain the source code of other analyzers. Hypothetically there is no limit to the # of things that could be encoded this way. I am currently working on a few prototypes in the vicinity of this.

https://learn.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/t...

rgmerk

(sarcasm on)

Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.

(sarcasm off).

rocqua

I've been trying to convince others of this, and gotten very little traction. One interesting response I got from someone very familiar with the Tamarin prover was that there just wasn't enough example code out there.

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!

jacquesm

This is a tough thing to do: predictions that are premised on the invention of something that does not exist today or that does not exist in the required form hinges on an unknown. Some things you can imagine but they will likely never exist (time travel, ringworlds, space elevators) and some hinge on one thing that still needs to be done before you can have that particular future come true. If the thing you need is 'AGI' then all bets are off. It could be next week, next month, next year or never.

This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.

lewisjoe

A while back I did this experiment where I asked ChatGPT to imagine a new language such that it's best for AI to write code in and to list the properties of such a language. Interestingly it spit out all the properties that are similar to today's functional, strongly typed, in fact dependently typed, formally verifiable / proof languages. Along with reasons why such a language is easier for AI to reason about and generate programs. I found it fascinating because I expected something like typescript or kotlin.

While I agree formal verification itself has its problems, I think the argument has merit because soon AI generated code will surpass all human generated code and when that happens we atleast need a way to verify the code can be proved that it won't have security issues or adheres to compliance / policy.

jameslk

> As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.

The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.

andy99

Maybe a stupid question, how do you verify the verification program? If an LLM is writing it too, isn’t it turtles all the way down, especially with the propensity of AI to modify tests so they pass?

pegasus

The proof is verified mechanically - it's very easy to verify that a proof is correct, what's hard is coming up with the proof (it's an NP problem). There can still be gotchas, especially if the statement proved is complex, but it does help a lot in keeping bugs away.

worldsayshi

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

arjie

Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

degamad

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

arjie

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.

imiric

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

arjie

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.