Semantic unit testing: test code without executing it
35 comments
·May 3, 2025vouwfietsman
motorest
> That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently (...)
The problem with your assertion is that it fails to understand that today's software, where every single line of code was typed in by real flesh-and-bone humans, already fails to have adequate test coverages, let alone be validated.
The main problem with output from LLMs is that they were trained with the code written by humans, and thus they accurately reflect the quality of the code that's found in the wild. Consequently, your line of reasoning actually criticizes LLMs for outputing the same unreliable code that people write.
Counterintuitively, LLMs end up generating a better output because at least they are designed to simplify the task of automatically generating tests.
PeterStuer
If you are working with natural language, it is by definition 'fuzzy' unless you reduce it to simple templates. So to evaluate whether an output is a semantically e.g. a reasonable answer to an input where non-templated natural verbalization is needed, you need something that 'tests' the output, and that is not going to be purely 'logical'.
Will that test be perfect? No. But what is the alternative?
darawk
This particular person seems to be using LLMs for code review, not generation. I agree that the problem is compounded if you use an LLM (esp. the same model) on both sides. However, it seems reasonable and useful to use it as an adjunct to other forms of testing, though not necessarily a replacement for them. Though again, the degree to which it can be a replacement is a function of the level of the technology, and it is currently at the level where it can probably replace some traditional testing methods, though it's hard to know which, ex-ante.
edit: of course, maybe that means we need a meta-suite, that uses a different LLM to tell you which tests you should write yourself and which tests you can safely leave to LLM review.
RainyDayTmrw
I'm skeptical. Most of us maintaining medium sized codebases or larger are constantly fighting nondeterminism in the form of flaky tests. I can't imagine choosing a design that starts with nondeterminism baked in.
And if you're really dead-set on paying nondeterminism to get more coverage, property-based testing has existed for a long time and has a comparatively solid track record.
IshKebab
I agree. I want this as a code review tool to check if people forgot to update comments - "it looks like this now adds instead of multiplies, but the comment says otherwise; did you forget to update it?".
Seems of dubious value as unit tests. LLMs don't seem to be quite smart enough for that in my experience, unless your bugs are really as trivial as adding instead of multiplying, in which case god help you.
mrkeen
Couldn't put it better myself.
I have the toughest time trying to communicate why f(x) should equal f(x) in the general case.
Garlef
Hm... I think you have a good point.
Maybe the non-determinism can be reduced by caching: Just reevaluate the spec if the code actually changes?
I think there are also other problems (inlining a verbal description makes the codebase verbose, writing a precise, non-ambiguous verbal description might be more work than writing unit tests)
carlmr
>Maybe the non-determinism can be reduced by caching: Just reevaluate the spec if the code actually changes?
That would be good anyway to keep the costs reasonable.
jonathanlydall
If you’re stuck with dynamically typed languages, then tests like this can make a lot of sense.
On statically typed languages this happens for free at compile time.
I’ve often heard proponents of dynamically typed languages say how all the typing and boiler plate required by statically typed languages feels like such a waste of time, and on a small enough system maybe they are right.
But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.
They also allow trivial refactoring that people using dynamically typed languages wouldn’t even consider due to the risk being so high.
So keep this all in mind when you next choose your language for a new project.
motorest
> But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.
I firmly believe that the group of people who laud dynamically typed languages as efficient time-savers, that help shed drudge work involving typing, is tightly correlated with the group of people who fail to establish any form of quality assurance or testing, often using the same arguments to justify their motivation.
0xDEAFBEAD
The question I find interesting is whether type systems are an efficient way to buy reliability relative to other ways to purchase reliability, such as writing tests, doing code review, or enforcing immutability.
Of course, some programmers just don't care about purchasing reliability. Those are the ones who eschew type systems, and tests, and produce unreliable software, about like you'd expect. But for my purposes, this is besides the point.
globular-toast
Rubbish, in my experience. People who understand dynamic languages know they need to write tests because it's the only thing asserting correctness. I could just as easily say static people don't write tests because they think the type system is enough. A type system is laughably bad at asserting correct behaviour.
Personally I do use type hinting and mypy for much of my Python code. But I'll most certainly omit it for throwaway scripts and trivial stuff. I'm still not convinced it's really worth the effort, though. I've had a few occasions where the type checker has caught something important, but most of the time it's an autist trap where you spend ages making it correct "just because".
motorest
> Rubbish, in my experience. People who understand dynamic languages know they need to write tests because it's the only thing asserting correctness.
Tests don't assert correctness. At best they verify specific invariants.
Statically typed languages lean on the compiler to automatically verify some classes of invariants (i.e., can I call this method in this object?)
With dynamically typed languages, you cannot lean on the compiler to verify these invariants. Developers must fill in this void by writing their own tests.
It's true that they "need" to do it to avoid some classes of runtime errors that are only possible in dynamically typed languages. But that's not the point. The point is that those who complan that statically typed languages are too cumbersome because they require boilerplate code for things type compile-time type checking are also correlated with the set of developers who fail to invest any time adding or maintaining automated test suites, because of the same reasons.
> I could just as easily say static people don't write tests because they think the type system is enough. A type system is laughably bad at asserting correct behaviour.
No, you can't. Developers who use statically typed languages don't even think of type checking as a concern, let alone a quality assurance issue.
gharzol
[dead]
0xDEAFBEAD
Dan Luu looked at the literature and concluded that the evidence for the benefit of types is underwhelming:
https://danluu.com/empirical-pl/
>But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.
OK, but if the alternative to tests is spending more time on a reliability method (type annotations) which buys you less reliability compared to writing tests... it's hardly a win.
It fundamentally seems to me that there are plenty of bugs that types can simply never catch. For example, if I have a "divide" function and I accidentally swap the numerator and divisor arguments, I can't think of any realistic type system which will help me. Other methods for achieving reliability, like writing tests or doing code review, don't seem to have the same limitations.
ngruhn
I think at least some people who say this think of Java-esque type systems. And there I agree: it is a boilerplate nightmare.
dragonwriter
This is more of "LLM code review" than any kind of testing, and calling it "testing" is just badly misleading.
spiddy
this. Let’s not confuse meanings. There are multiple ways to improve quality of code. Testing is one, code review is another. this belongs to the latter
IshKebab
Yeah this sounds like a good way to detect out of date comments. I would have focused on that.
anself
Agree, it's not testing. The problem is here: "In a typical testing workflow, you write some basic tests to check the core functionality. When a bug inevitably shows up—usually after deployment—you go back and add more tests to cover it. This process is reactive, time-consuming, and frankly, a bit tedious."
This is exactly the problem that TDD solves. One of the most compelling reasons for test-first is because "Running the code in your head" does not actually work well in practice, leading to the above-cited issues. This is just another variant of "Running the code in your head" except an LLM is doing it. Strong TDD practices (don't write any code without a test to support it) will close those gaps. It may feel tedious at first but the safety it creates will leave you never wanting to go back.
Where this could be safe and useful: Find gaps in the test-set. Places where the code was never written because there wasn't a test to drive it out. This is one of the hardest parts of TDD, and where LLMs could really help.
yuliyp
Did the author do any analysis of the effectiveness of their tool on something beyond multiplication? Did they look to see if it caught any bugs in any codebases? What's the false positive rate? False negative?
As is it's neat that they wrote some code to generate some prompts for an LLM but there's no idea if it actually works.
motorest
> Did the author do any analysis of the effectiveness of their tool on something beyond multiplication? Did they look to see if it caught any bugs in any codebases? What's the false positive rate? False negative?
I would also add the concern on whether the tests are actually deterministic.
The premise is also dubious, as docstring comments typically hold only very high-level descriptions of the implementation and often aren't even maintained. Writing a specification of what a function is expected to do is what writing tests is all about, and with LLMs these are a terse prompt away.
masklinn
> But here’s the catch: you’re missing some edge cases. What about negative inputs?
The docstring literally says it only works with positive integers, and the LLM is supposed to follow the docstring (per previous assertions).
> The problem is that traditional tests can only cover a narrow slice of your function’s behavior.
Property tests? Fuzzers? Symbolic execution?
> Just because a high percentage of tests pass doesn’t mean your code is bug-free.
Neither does this thing. If you want your code to be bug-free what you're looking for is a proof assistant not vibe-reviewing.
Also
> One of the reasons to use suite is its seamless integration with pytest.
Exposing a predicate is not "seamless integration with pytest", it's just exposing a predicate.
stoical1
Test driving a car by looking at it
simianwords
I was a bit skeptical at first but I think this is a good idea. Although I'm not convinced with the usage of max_depth parameter. In real life you rarely know what type your dependencies are if they are loaded at run time. This is kind of why we explicitly mock our dependencies.
On a side note: I have wondered whether LLM's are particularly good with functional languages. Imagine if your code entirely consisted of just pure functions and no side effects. You pass all parameters required and do not use static methods/variables and no OOP concepts like inheritance. I imagine every program can be converted in such a way, the tradeoff being human readability.
cerpins
It sounds like it might be a good use case for testing documentation - verifying whether what documentation describes is actually in accordance with the code, and then you can act on it. With that in mind, it's also probably pointless to re-run if relevant code or documentation hasn't changed.
rollulus
I wonder if the random component of the LLM makes every test flaky by definition.
gnabgib
This seems to be your site @op.. your CSS needs attention. On a narrower screen (ie. portrait) the text is enormous, and worse, zooming out shrinks the quantity of words (increases the font-size).. which is the surely the opposite of expected? It's basically unusable.
Your CSS seems to assume all portrait screens (whether 80" or 3") deserve the same treatment.
stephantul
This is cool! I think that, in general, generating test cases “offline” using an LLM and then running them using regular unit testing also solves this particular issue.
It also might be more transparent and cheaper.
Maybe someone can help me out here:
I always get the feeling that fundamentally our software should be built on a foundation of sound logic and reasoning. That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently lack logic and reasoning, or at least such validation must be on par with human authored code + review. Because of this, the validation cannot be done by an LLM, as it would just compound the problem.
Unless we get a drastic change in the level of error detection and self-validation that can be done by an LLM, this remains a problem for the foreseeable future.
How is it then that people build tooling where the LLM validates the code they write? Or claim 2x speedups for code written by LLMs? Is there some kind of false positive/negative tradeoff I'm missing that allows people to extract robust software from an inherently not-robust generation process?
I'm not talking about search and documentation, where I'm already seeing a lot of benefit from LLMs today, because between the LLM output and the code is me, sanity checking and filtering everything. What I'm asking about is the: "LLM take the wheel!" type engineering.