Using LLMs to enhance our testing practices

81 comments

·October 24, 2024

renegade-otter

In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.

"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).

Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.

jeswin

> Using an autocomplete to "bang it out" seems foolish.

Based on my own experience, I find the widespread scepticism on HN about AI-assisted coding misplaced. There will be corner cases, there will be errors, and there will be bugs. There will also be apps for which AI is not helpful at all. But that's fine - nobody is saying otherwise. The question is only about whether it is a _significant_ nett saving on the time spent across various project types. The answer to that is a resounding Yes.

The entire set of tests for a web framework I wrote recently were generated with Claude and GPT. You can see them here: https://github.com/webjsx/webjsx/tree/main/src/test

On an average, these tests are better than tests I would have written myself. The project was written mostly by AI as well, like most other stuff I've written since GPT4 came out.

"Using an autocomplete to bang it out" is exactly what one should do - in most cases.

beepbooptheory

Ok but looking at those tests for just a second (for createElement), you might want to go through it again, or ask the computer or whatever. For example, edgeCases.test.ts is totally redundant, you are running the same exact tests in children.test.ts.

Edit: such a LLM repo... why did it feel the need to recreate these DOM types? Is your AI just trying to maximize LoC? It just seems like such a pain and potential source of real trouble when these are already available. https://github.com/webjsx/webjsx/blob/main/src%2FjsxTypes.ts

jeswin

Actually, the file you identified is the (only) one that's mostly human written. It came from a previous project, I may be able to get rid of it.

But generally, the tests are very useful. My point is that there will be redundancies, and maybe even bugs - and that's fine, because the time needed to fix these would be much less than what it would have taken to write them from scratch.

thanksgiving

I want to bring my own experience from a code base I briefly worked on, I worked on a module of code where basically all the unit tests assertions were commented out. This was about ten years ago. The meta is there should still be someone responsible for the code an LLM generated and there should still be at least one more person who does a decent code review at some point. Otherwise, the unit tests being there is useless just like the example I gave on top with the assertions removed.

swatcoder

Fully agreed.

It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.

As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.

BeetleB

Mostly agree.

My first thought when I read this post was: Is his goal to test the code, or validate the features?

The first problem is he's providing the code, and asking for tests. If his code has a bug, the tests will enshrine those bugs. It's like me writing some code, and then giving it to a junior colleague, not providing any context, and saying "Hey, write some tests for this."

This is backwards. I'm not a TDD guy, but you should think of your test cases independent of your code.

_puk

But in a system that exists without tests (this is the real world after all), the current functionality is already enshrined in the app.

Adding tests that capture the current state of things, so that when that bug is uncovered tests can easily be updated to the correct functionality to prove the bug prior to fixing it is a much better place to be than the status quo.

The horse may have bolted from the barn, but we can at least close the farm gate in the hopes of recapturing it eventually.

renegade-otter

Right! AI is going to help you write passing tests - not BREAK your code, which is the whole point of writing tests.

GuB-42

Tests are not just for breaking your code. Writing passing tests is great for regression testing, which I think is the most important kind of unit testing.

If your goal is to break your code, try fuzzing. For some reason, it seems that the only people who do it are in the field of cybersecurity. Fuzzing can do more than find vulnerabilities.

sumedh

> not providing any context

You can provide the context to an AI model though, you can share the source with it.

danmaz74

I subscribe to the concept of the "pyramid of tests" - lots of simpler unit tests, fewer integration tests, and very few end-to-end tests. I find that using LLMs to write unit tests is very useful. If I just wrote code which has good naming both for the classes, methods and variables, useful comments where necessary and if I already have other tests which the LLMs can use as examples for how I test things, I usually just need to read the created tests and sometimes add some test cases, just writing the "it should 'this and that'" part for cases which weren't covered.

An added bonus is that if the tests aren't what you expect, often it helps you understand that the code isn't as clear as it should be.

holbrad

I also subscribe to a testing pyramid but I think it's commonly upside down IMO.

You should have a few very granular unit tests for where they make the most sense (Known dangerous areas, or where they are very easy to write eg. analysis)

More library/service tests. I read in an old config file and it has the values I expect.

Integration/system tests should be the most common, I spin up the entire app in a container and use the public API to test the application as a whole.

Then most importantly automated UI tests, I do the standard normal customer workflows and either it works or it doesn't.

The nice thing is that when you strongly rely on UI and public API tests you can have very strong confidence that your core features actually work. And when there are bugs they are far more niche. And this doesn't require many tests at all.

(We've all been in the situation where the 50,000 unit tests pass and the application is critically broken)

renegade-otter

This is exactly my experience.

viraptor

Pretty much this and I prefer the opposite. "Here's the new test case from me, make the code pass it" is a decent workflow with Aider.

I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.

skissane

> "More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code.

Are there ways we can measure this?

One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.

Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.

Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?

fijiaarone

Sometimes you actually have to think, or hire someone who can. Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.

skissane

> Sometimes you actually have to think, or hire someone who can.

I'm perfectly capable of thinking. Thinking about "how can I create a system which reduces some of my cognitive load on testing so I can spend more of my cognitive resources on other things" is a particularly valuable form of thinking.

> Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.

That problem is when managers take a metric and turn it into a KPI. That doesn't happen to all metrics. I can think of many metrics I've personally collected that no manager ever once gazed upon.

The real measure of a metric's value, is how meaningful a domain expert finds it to be. And if the answer to that is "not very" – is that an inherent property of metrics, or a sign that the metric needs to be refined?

nrnrjrjrj

There is an art to writing tests especially getting absraction levels right. For example do you integration test hitting the password field with 1000 cases or do that as a unit test, and does doing it as a unit test sufficiently cover this.

AI could do all this thinking in the future but not yet I believe!

Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.

LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.

dngit

Great point on focusing on high-impact tests. I agree that LLMs risk giving a false sense of coverage. Maybe a smart strategy is generating boilerplate tests while we focus on custom edge cases.

idoco

Absolutely with you on the need for high-impact tests. I find that humans are still way better at coming up with the tests that actually matter, while AI can handle the implementation faster—especially when there’s a human guiding it.

Keeping a human in the loop is essential, in my experience. The AI does the heavy lifting, but we make sure the tests are genuinely useful. That balance helps avoid the trap of churning out “dumb” tests that might look impressive but don’t add real value.

mastersummoner

I actually tested Claude Sonnet to see how it would fare at writing a test suite for a background worker. My previous experience was with some version of GPT via Copilot, and it was... not good.

I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.

I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.

shadowmanifold

This latest update to Sonnet is super impressive.

We are really already past the point of being able to discuss these matters though in large groups.

The herd speaks as if all LLMs on all programming languages are basically the same.

It is an absurdity. Talking to the herd is mostly for entertainment at this point. If I actually want to learn something, I will ask Sonnet.

throwa5456435

All this makes me think making software engineers redundant is really the "killer app" of LLM's. This is where the AI labs are spending most of the effort - its the best marketing after all for their product - fear sells better than greed (loss aversion) making engineers notice and unable to dismiss it.

Despite some of the comments on this thread, despite it not wanting to be true, I must admit LLM's are impressive. Software engineers and ML specialists have finally invented the thing which disrupts their own jobs substantially either via large reduction in hours and/or reduction in staff. As the hours a software engineer spends coding diminishes by large factors so too especially in this economy will hours spent required paying an engineer will fall up to the point where anyone can create code and learn from an LLM as you have just done. Once everybody is special, no one is and fundamentally employment, and value of things created from software, comes from scarcity just like everything else in our current system.

I think there's probably only a few years left where software engineers are around - or at least seen as a large part of an organization with large teams, etc. Yes AI software will have bugs, and yes it won't be perfect but you can get away with just one or two for a whole org to fix the odd blip of an LLM. It feels like people are picking on minor things at this point, which while true, for a business those costs are "meh" while the gains of removing engineers are substantial.

I want to be wrong; but every time I see someone "learning from LLM's", saving lots of time doing stuff, saving 100's of hours, etc I think its only 2-3 years in and already its come this far.

fragmede

> Yes AI software will have bugs, and yes it won't be perfect but you can get away with just one or two for a whole org to fix the odd blip of an LLM.

Maybe. A lot of places have headcount limits on software devs because of budget constraints. As in, the reason they don't hire more is because they can't afford it, not because there is a shortage of code to write and bugs to give. The more optimistic view is that the nature of being a software engineer will adjust to increased productivity and focus on the parts of the job that LLMs can't do, with the market for experts who are skilled at removing "the odd blip from an LLM". Expertise will also move into areas where there's less or insufficient training data for a particular niche. One way to future proof yourself is to find places where it frequently makes up non existent libraries and is bad at code in a language, and specialize in that.

mkleczek

I am very sceptical of LLM (or any AI) code generation usefulness and it does not really have anything to do with AI itself.

In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.

IOW: how should we treat generated code?

If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.

This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.

The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.

EDIT: typos

mvdtnz

You should NEVER modify generated code. All of our generated code is pretended with a big comment that says "GENERATED CODE DO NOT MODIFY. This code could be regenerated at any time and any changes will be lost."

If you need to change behaviour of generated code you need to change your generator to provide the right hooks.

Obviously none of this applies to "AI" generated code because the "AI" generator is not deterministic and will hallucinate different bugs from run to run. You must treat "AI" generated code as if it was written by the dumbest person you've ever worked with.

fragmede

The reason you don't modify generated code is it gets clobbered upon regeneration. The reason it's okay to modify LLM-generated code is that it gets fed that back into the LLM for subsequent modification.

mkleczek

That's exactly my point :)

smokel

I agree. Adding unit tests without a good reason comes at a cost.

Refactoring is harder, especially if it's not clear why a test is in place. I've seen many developers disable tests simply because they could not understand how, or why, to fix them.

I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation.

mkleczek

> I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation

I am very sceptical here as well. The biggest problem with formulating requirements or design documentation is translation from informal to formal language. In other words... writing programs.

LLMs are good at generating content that doesn't provide useful information (ie. have low information content). Their usefulness right now is caused by the fact that people are used to reading lot of text and distill information from it (ie. all the useless e-mails formulated in corporate language, all multi-page requirement documents formulated in human readable form). The job of a software engineer is to extract information from low information content text and write it down in a formal language.

In this context:

What I expect in the long run is that people will start to value high information content and concise text. And obviously - it cannot be generated by any LLM, because LLM cannot provide any information by itself. There is really no point in: provide short high information content text (ie. prompt) to LLM -> receive long low information content text from LLM -> extract information from long text.

DeathArrow

It's hard to generate tests for typical C# code. Or for any context where you have external dependencies.

If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.

You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.

I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.

Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.

It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.

nazgul17

Should we not, instead, write tests ourselves and have LLMs write the code to make them pass?

jayd16

Just ask it to do both.

sdesol

And remember to always challenge the response with both the same and different models. No joke. Just continue the conversation for the example in the blog and ask the LLM "Do you see anything wrong with the code?" and it will spit out "Yes" and explain why.

satisfice

Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

johnjwang

Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.

Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).

Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.

re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.

satisfice

Have you systematically tested this approach? It sounds like you are reporting on your good vibes. Your writing is strictly anecdotal.

I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.

You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.

Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.

youoy

I would say that the complacency part is identifying good test with good coverage. I agree that writing test is one of the best use cases for LLMs, and it definitely saves engineers a lot of time. But if you follow them to blindly it is easy to get carried away by how easy it is to write tests that focus on coverage instead of actually testing more quality things. Which is what the previous comment was pointing at:

> which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

wenc

I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.

Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.

Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.

I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.

simonw

The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.

If you don't understand how the code works, don't approve it.

Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.

hitradostava

100% agree. We don't expect human developers to be perfect, why should we expect AI assistants. Code going to production should go through review.

I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...

sdesol

> 100% agree. We don't expect human developers to be perfect, why should we expect AI assistants.

I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:

https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...

If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.

I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.

I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."

mvdtnz

> We don't expect human developers to be perfect, why should we expect AI assistants.

What absolute nonsense. What an absurd false equivalence. It's not that we expect perfection or even human level performance from "AI". It's that the crap that comes out of LLMs is not even at the level of a first year student. I've never in my entire life reviewed the code of a junior engineer and seen them invent third party APIs from whole cloth. I've never had a junior send me code that generates a payload that doesn't validate at the first layer of the operation with zero manual testing to check it. No junior has ever asked me to review a pull request containing references to an open source framework that doesn't exist anywhere in my application. Yet these scenarios are commonplace in "AI" generated code.

tsv_

Each time a new LLM version comes out, I give it another try at generating tests. However, even with the latest models, tailored GPTs, and well-crafted prompts with code examples, the same issues keep surfacing:

- The models often create several tests within the same equivalence class, which barely expands test coverage

- They either skip parameterization, creating multiple redundant tests, or go overboard with 5+ parameters that make tests hard to read and maintain

- The model seems focused on "writing a test at any cost" often resorting to excessive mocking or monkey-patching without much thought

- The models don’t leverage existing helper functions or classes in the project, requiring me to upload the whole project context each time or customize GPTs for every individual project

Given these limitations, I primarily use LLMs for refactoring tests where IDE isn’t as efficient:

- Extracting repetitive code in tests into helpers or fixtures

- Merging multiple tests into a single parameterized test

- Breaking up overly complex parameterized tests for readability

- Renaming tests to maintain a consistent style across a module, without getting stuck on names

deeviant

All of the points you raise I find common in human written tests.

iambateman

I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.

Happy to open source if anyone is interested.

frays

I'd certainly be interested to read more about your experience!

gengstrand

I went with a more clinical approach and used models that were available a half year ago but I also was interested in using LLMs to write unit tests. You can learn the details of that experiment at https://www.infoq.com/articles/llm-productivity-experiment/ but the net of what I found was that LLMs improve developer productivity in the form of unit test creation but only marginally. Perhaps I find myself a bit skeptical on the claims from that Assembled blog on significant improvement.

null

[deleted]

null

[deleted]

simonw

If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.

johnjwang

Thanks for the suggestion -- I'll take a look into adding this!