Skip to content(if available)orjump to list(if available)

Using LLMs to enhance our testing practices

renegade-otter

In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.

"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).

Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.

swatcoder

Fully agreed.

It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.

As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.

iambateman

I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.

Happy to open source if anyone is interested.

simonw

If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.

johnjwang

Thanks for the suggestion -- I'll take a look into adding this!

apwell23

i would love to used to use it change code in ways that compiles and see if test fails. Coverage metric sometimes doesn't really tell you if some piece of code is covered or not.

sesm

Coverage metric can tell if lines of code were executed, but they can't tell if execution result was checked.

satisfice

Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

johnjwang

Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.

Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).

Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.

re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.

youoy

I would say that the complacency part is identifying good test with good coverage. I agree that writing test is one of the best use cases for LLMs, and it definitely saves engineers a lot of time. But if you follow them to blindly it is easy to get carried away by how easy it is to write tests that focus on coverage instead of actually testing more quality things. Which is what the previous comment was pointing at:

> which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

satisfice

Have you systematically tested this approach? It sounds like you are reporting on your good vibes. Your writing is strictly anecdotal.

I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.

You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.

Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.

wenc

I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.

Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.

Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.

I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.

simonw

The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.

If you don't understand how the code works, don't approve it.

Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.

hitradostava

100% agree. We don't expect human developers to be perfect, why should we expect AI assistants. Code going to production should go through review.

I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...

sdesol

> 100% agree. We don't expect human developers to be perfect, why should we expect AI assistants.

I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:

https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...

If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.

I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.

I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."

null

[deleted]