As an experienced LLM user, I don't use generative LLMs often

146 comments

·May 5, 2025

tptacek

There's a thru-line to commentary from experienced programmers on working with LLMs, and it's confusing to me:

Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying.

The post does later touch on coding agents (Max doesn't use them because "they're distracting", which, as a person who can't even stand autocomplete, is a position I'm sympathetic to), but still: coding agents solve the core problem he just described. "Raw" LLMs set loose on coding tasks throwing code onto a blank page hallucinate stuff. But agenty LLM configurations aren't just the LLM; they're also code that structures the LLM interactions. When the LLM behind a coding agent hallucinates a function, the program doesn't compile, the agent notices it, and the LLM iterates. You don't even notice it's happening unless you're watching very carefully.

darepublic

So in my interactions with gpt, o3 and o4 mini, I am the organic middle man that copy and pastes code into the repl and reports on the output back to gpt if anything should be the problem. And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process. Maybe the llms you are using are better than the ones I tried this with?

Specifically I was researching a lesser known kafka-mqtt connector: https://docs.lenses.io/latest/connectors/kafka-connectors/si..., and o1 was hallucinating the configuration needed to support dynamic topics. The docs said one thing, and I even mentioned it to o1 that the docs contradicted with it. But it would stick to its guns. If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios -- did you spell this correctly? Responses like that indicate you've reached a dead end. I'm curious how/if the "structured LLM interactions" you mention overcome this.

diggan

> And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.

Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.

But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).

mr_toad

> If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios

I have to chuckle at that because it reminds me of a typical response on technical forums long before LLMs were invented.

Maybe the LLM has actually learned from those responses and is imitating them.

nikita2206

You can have agent search the web for documentation and then provide it to the LLM. That is how Context7 is currently very popular in the AI user crowd.

entropie

I used o4 to generate nixos config files from the pasted modules source files. At first it did outdated config stuff, but with context files it worked very good.

dingnuts

Kagi Assistant can do this too but I find it's mostly useful because the traditional search function can find the pages the LLM loaded into its context before it started to output bullshit.

It's nice when the LLM outputs bullshit, which is frequent.

jimbokun

I wonder if LLMs have been seen claiming “THERE’S A BUG IN THE COMPILER!”

A stage every developer goes through early in their development.

AlexCoventry

He does partially address this elsewhere in the blog post. It seems that he's mostly concerned about surprise costs:

> On paper, coding agents should be able to address my complaints with LLM-generated code reliability since it inherently double-checks itself and it’s able to incorporate the context of an entire code project. However, I have also heard the horror stories of people spending hundreds of dollars by accident and not get anything that solves their coding problems. There’s a fine line between experimenting with code generation and gambling with code generation.

minimaxir

Less surprise costs, more wasting money and not getting proportionate value out of it.

zoogeny

For several moments in the article I had to struggle to continue. He is literally saying "as an experienced LLM user I have no experience with the latest tools". He gives a rationale as to why he hasn't used the latest tools which is basically that he doesn't believe they will help and doesn't want to pay the cost to find out.

I think if you are going to claim you have an opinion based on experience you should probably, at the least, experience the thing you are trying to state your opinion on. It's probably not enough to imagine the experience you would have and then go with that.

satvikpendem

Cursor also can read and store documentation so it's always up to date [0]. Surprised that many people I talk to about Cursor don't know about this, it's one of its biggest strengths compared to other tools.

[0] https://docs.cursor.com/context/@-symbols/@-docs

null

[deleted]

vunderba

That sort of "REPL" system is why I really liked when they integrated a Python VM into ChatGPT - it wasn't perfect, but it could at least catch itself when the code didn't execute properly.

tptacek

Sure. But it's 2025 and however you want to get this feature, be it as something integrated into VSCode (Cursor, Windsurf, Copilot), or a command line Python thing (aider), or a command line Node thing (OpenAI codex and Claude Code), with a specific frontier coding model or with an abstracted multi-model thingy, even as an Emacs library, it's available now.

I see people getting LLMs to generate code in isolation and like pasting it into a text editor and trying it, and then getting frustrated, and it's like, that's not how you're supposed to be doing it anymore. That's 2024 praxis.

gnatolf

The churn of staying on top of this means to me that we'll also chew through experts of specific times much faster. Gone are the day of established, trusted top performers, as every other week somebody creates a newer, better way of doing things. Everybody is going to drop off the hot tech at some point. Very exhausting.

arthurcolle

I like using Jupyter Console as a primary interpreter, and then dropping into SQLite/duckdb to save data

Easy to to script/autogenerate code and build out pipelines this way

red_hare

It is a little crazy how fast this has changed in the past year. I got VSCode's agent mode to write, run, and read the output of unit tests the other day and boy it's a game changer.

surgical_fire

This has been my experience with any LLM I use as a code assistant. Currently I mostly use Claude 3.5, although I sometimes use Deepseek or Gemini.

The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.

Which is why I find them them most useful to help me build things when I am very familiar with the subject matter, because at that point I can quickly spot misconceptions, errors, bugs, etc.

It's when it hits the sweet spot of being a productivity tool, really improving the speed with which I write code (and sometimes improving the quality of what I write, for sometimes incorporating good practices I was unaware of).

steveklabnik

> The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.

One very interesting variant of this: I've been experimenting with LLMs in a react-router based project. There's an interesting development history where there's another project called Remix, and later versions of react-router effectively ate it, that is, in December of last year, react-router 7 is effectively also Remix v3 https://remix.run/blog/merging-remix-and-react-router

Sometimes, the LLM will be like "oh, I didn't realize you were using remix" and start importing from it, when I in fact want the same imports, but from react-router.

All of this happened so recently, it doesn't surprise me that it's a bit wonky at this, but it's also kind of amusing.

zoogeny

In addition to choosing languages, patterns and frameworks that the LLM is likely to be well trained in, I also just ask it how it wants to do things.

For example, I don't like ORMs. There are reasons which aren't super important but I tend to prefer SQL directly or a simple query builder pattern. But I did a chain of messages with LLMs asking which would be better for LLM based development. The LLM made a compelling case as to why an ORM with a schema that generated a typed client would be better if I expected LLM coding agents to write a significant amount of the business logic that accessed the DB.

My dislike of ORMs is something I hold lightly. If I was writing 100% of the code myself then I would have breezed past that decision. But with the agentic code assistants as my partners, I can make decisions that make their job easier from their point of view.

aerhardt

> the program doesn't compile

The issue you are addressing refers specifically to Python, which is not compiled... Are you referring to this workflow in another language, or by "compile" do you mean something else, such as using static checkers or tests?

Also, what tooling do you use to implement this workflow? Cursor, aider, something else?

dragonwriter

Python is, in fact, compiled (to bytecode, not native code); while this is mostly invisible, syntax errors will cause it to fail to compile, but the circumstances described (hallucinating a function) will not, because function calls are resolved by runtime lookup, not at compile time.

aerhardt

I get that, and in that sense most languages are compiled, but generally speaking, I've always understood "compiled" as compiled-ahead-of-time - Python certainly doesn't do that and the official docs call it an interpreted language.

In the context we are talking about (hallucinating Polars methods), if I'm not mistaken the compilation step won't catch that, Python will actually throw the error at runtime post-compilation.

So my question still stands on what OP means by "won't compile".

null

[deleted]

mountainriver

Yes but it gets feedback from the IDE. Cursor is the best here

andy99

Re vibe coding, I agree with your comments but where I've used it is when I needed to mock up a UI or a website. I have no front end experience so making a 80% (probably 20%) but live demo is still a valuable thing, to show to others to get the point across, obviously not to deploy. It's a replacement for drawing a picture of what I think the UI should look like. I feel like this is an under-appreciated use. LLM coding is not remotely ready for real products but it's great for mock-ups that further internal discussions.

vunderba

Same. As somebody who doesn't really enjoy frontend work at all, they are surprisingly good at being able to spit out something that is relatively visually appealing - even if I'll end up rewriting the vast majority of react spaghetti code in Svelte.

NetOpWibby

In the settings for Claude, I tell it to use Svelte and TypeScript whenever possible because I got tired of telling it I don't use React.

leptons

I love front-end work, and I'm really good at it, but I now let the AI do CSS coding for me. It seems to make nice looking buttons and other design choices that are good enough for development. My designer has their own opinions, so they always will change the CSS when they get their hands on it, but at least I'm not wasting my time creating really ugly styles that always get replaced anymore. The rest of the coding is better if I do it, but sometimes the AI surprises me - though most often it gets it completely wrong, and then I'm wasting time letting it try and that just feels counterproductive. It's like a really stupid intern that almost never pays attention to what the goal is.

mattmanser

They're pretty good at following direction. For example you can say:

'Usw React, typescript, materialUi, prefer functions over const, don't use unnecessary semi colons, 4 spaces for tabs, build me a UI that looks like this sketch'

And it'll do all that.

I think it would be faster/easier to use a website builder or various templating libraries to build a quick UI rather than having to babysit an LLM with prompts over and over again.

simonw

> However, for more complex code questions particularly around less popular libraries which have fewer code examples scraped from Stack Overflow and GitHub, I am more cautious of the LLM’s outputs.

That's changed for me in the past couple of months. I've been using the ChatGPT interface to o3 and o4-mini for a bunch of code questions against more recent libraries and finding that they're surprisingly good at using their search tool to look up new details. Best version of that so far:

"This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it."

This actually worked! https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

The other trick I've been using a lot is pasting the documentation or even the entire codebase of a new library directly into a long context model as part of my prompt. This works great for any library under about 50,000 tokens total - more than that and you usually have to manually select the most relevant pieces, though Gemini 2.5 Pro can crunch through hundreds of thousands of tokens pretty well with getting distracted.

Here's an example of that from yesterday: https://simonwillison.net/2025/May/5/llm-video-frames/#how-i...

zoogeny

I think they might have made a change to Cursor recently as well. A few times I've caught it using an old API of popular libraries that have updates. Shout out to all the library developers that are logging deprecations and known incorrect usages, that has been a huge win with LLMs. In most cases I can paste the deprecation warning back into the agent and it will say "Oh, looks like that API changed in vX.Y.Z, we should be doing <other thing>, let me fix that ..."

So it is capable of integrating new API usage, it just isn't a part of the default "memory" of the LLM. Given how quickly JS libraries tend to change (even on the API side) that isn't ideal. And given that the typical JS server project has dozens of libs, including the most recent documentation for each is not really feasible. So for now, I am just looking out for runtime deprecation errors.

But I give the LLM some slack here, because even if I was programming myself using an library I've used in the past, I'm likely to make the same mistake.

satvikpendem

You can just use @Docs [0] to import the correct documentation for your libraries.

[0] https://docs.cursor.com/context/@-symbols/@-docs

null

[deleted]

ziml77

I like that the author included the chat logs. I know there's a lot of times where people can't share them because they'd expose too much info, but I really think it's important when people make big claims about what they've gotten an LLM to do that they back it up.

minimaxir

That is a relatively new workflow for me since getting the logs out of the Claude UI is a more copy/paste manual process. I'm likely going to work on something to automate it a bit.

simonw

I use this:

  llm -m claude-3.7-sonnet "prompt"
  llm logs -c | pbcopy

Then paste into a Gist. Gets me things like this: https://gist.github.com/simonw/0a5337d1de7f77b36d488fdd7651b...

fudged71

Isn't your Observable notebook more applicable to what he's talking about (scraping the Claude UI)? https://x.com/simonw/status/1821649481001267651

behnamoh

> I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary.

Yes, I also often use the "studio" of each LLM for better results because in my experience OpenAI "nerfs" models on the ChatGPT UI (models keep forgetting things—probably a limited context length set by OpenAI to reduce costs, generally the model is less chatty (again, probably to reduce their costs), etc. But I've noticed Gemini 2.5 Pro is the same on the studio and the Gemini app.

> Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where...

ChatGPT does have system prompts but Claude doesn't (one of its many, many UI shortcomings which Anthropic never addressed).

That said, I've found system prompts less and less useful with newer models. I can simply preface my own prompt with the instructions and the model follows them very well.

> Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com.

I get that LLMs have a vague idea of how many words are 30 words, but they never do a good job in these tasks for me.

Jerry2

> I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality

Hey Max, do you use a custom wrapper to interface with the API or is there some already established client you like to use?

If anyone else has a suggestion please let me know too.

simonw

I'm going to plug my own LLM CLI project here: I use it on a daily basis now for coding tasks like this one:

llm -m o4-mini -f github:simonw/llm-hacker-news -s 'write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or &timestamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.' -o reasoning_effort high

Any time I use it like that the prompt and response are logged to a local SQLite database.

More on that example here: https://simonwillison.net/2025/May/5/llm-video-frames/#how-i...

minimaxir

I was developing an open-source library for interfacing with LLMs agnostically (https://github.com/minimaxir/simpleaichat) and although it still works, I haven't had the time to maintain it unfortunately.

Nowadays for writing code to interface with LLMs, I don't use client SDKs unless required, instead just hitting HTTP endpoints with libraries such as requests and httpx. It's also easier to upgrade to async if needed.

asabla

most services has a "studio mode" for their models served.

As an alternative you could always use OpenWebUI

danenania

I built an open source CLI coding agent for this purpose[1]. It combines Claude/Gemini/OpenAI models in a single agent, using the best/most cost effective model for different steps in the workflow and different context sizes. You might find it interesting.

It uses OpenRouter for the API layer to simplify use of APIs from multiple providers, though I'm also working on direct integration of model provider API keys—should release it this week.

1 - https://github.com/plandex-ai/plandex

null

[deleted]

Oras

JSON response doesn’t always work as expected unless you have few items to return. In Max’s example it’s classification.

For anyone trying to return consistent json, checkout structured data where you define a json schema with required field and that would return the same structure all the time.

I have tested it with high success using GPT-4o-mini.

qoez

I've tried it out a ton but the only thing I end up using it for these days is teaching me new things (which I largely implement myself; it can rarely one-shot it anyway). Or occasionally to make short throwaway scripts to do like file handling or ffmpeg.

Beijinger

"To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary."

How do you do this? Do you have to be on a paid plan for this?

diggan

I think they're talking about the Sandbox/Playground/Editor thingy that almost all companies who expose APIs also offer to quickly try out API features. For OpenAI it's https://platform.openai.com/playground/prompts?models=gpt-4...., Anthropic has https://console.anthropic.com/workbench and so on.

minimaxir

If you log into the API backend, there is usually a link to the UI. For OpenAI/ChatGPT, it's https://platform.openai.com/playground

This is independent of ChatGPT+. You do need to have a credit card attached but you only pay for your usage.

danbrooks

As a data scientist, this mirrors my experience. Prompt engineering is surprisingly important for getting expected output - and use LLM POCs have quick turnaround times.

rfonseca

This was an interesting quote from the blog post: "There is one silly technique I discovered to allow a LLM to improve my writing without having it do my writing: feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post."

vunderba

I do a good deal of my blog posts while walking my husky and just dictating using speech-to-text on my phone. The problem is that its an unformed blob of clay and really needs to be shaped on the wheel.

I then feed this into an LLM with the following prompt:

  You are a professional editor. You will be provided paragraphs of text that may 
  contain spelling errors, grammatical issues, continuity errors, structural 
  problems, word repetition, etc. You will correct any of these issues while 
  still preserving the original writing style. Do not sanitize the user. If they 
  use profanities in their text, they are used for emphasis and you should not 
  omit them. 

  Do NOT try to introduce your own style to their text. Preserve their writing 
  style to the absolute best of your ability. You are absolutely forbidden from 
  adding new sentences.

It's basically Grammarly on steroids and works very well.

meowzero

I do something similar. But I make sure the LLM doesn't know I wrote the post. That way the LLM is not sycophantic.

null

[deleted]

kixiQu

What roleplayed feedback providers have people had best and worst luck with? I can imagine asking for the personality could help the LLM come up with different kinds of criticisms...

null

[deleted]

Snuggly73

Emmm... why has Claude 'improved' the code by setting SQLite to be threadsafe and then adding locks on every db operation? (You can argue that maybe the callbacks are invoked from multiple threads, but they are not thread safe themselves).

dboreham

Interns don't understand concurrency either.

daxfohl

But if you teach them the right way to do it today and have them fix it, they won't go and do it the wrong way again tomorrow and the next day amd every day for the rest of the summer.

chowells

With concurrency, they still might get it wrong the rest of the summer. It's a hard topic. But at least they might learn they need to ask for feedback when they're doing that looks similar to stuff that's previously caused problems.

HN

As an experienced LLM user, I don't use generative LLMs often

As an experienced LLM user, I don't use generative LLMs often