Show HN: My LLM CLI tool can run tools now, from Python code or plugins

171 comments

·May 27, 2025

kristopolous

It's worth noting the streaming markdown renderer I wrote just for this tool: https://github.com/day50-dev/Streamdown

More background: https://github.com/simonw/llm/issues/12

(Also check out https://github.com/day50-dev/llmehelp which features a tmux tool I built on top of Simon's llm. I use it every day. Really. It's become indispensable)

kristopolous

Also I forgot to one other built on llm.

This one is a ZSH plugin that uses zle to translate your English to shell commands with a keystroke.

https://github.com/day50-dev/Zummoner

It's been life changing for me. Here's one I wrote today:

    $ git find out if abcdefg is a descendent of hijklmnop

In fact I used it in one of these comments

    $ for i in $(seq 1 6); do 
      printf "%${i}sh${i}\n\n-----\n" | tr " " "#"; 
    done | pv -bqL 30

Was originally

    $ for i in $(seq 1 6); do 
      printf "(# $i times)\n\n-----\n"
    done | pv (30 bps and quietly)

I did my trusty ctrl-x x and the buffer got sent off through openrouter and got swapped out with the proper syntax in under a second.

kazinator

The brace expansion syntax in Bash and Zsh expands integer ranges: {1..6}; no calling out to external command.

It's also intelligent about inferring leading zeros without needing to be told with options, e.g. {001..995}.

null

[deleted]

vicek22

This is fantastic! Thank you for that.

I use fish, but the language change is straightforward https://github.com/viktomas/dotfiles/blob/master/fish/.confi...

I'll use this daily

CGamesPlay

I built a similar one to this one: https://github.com/CGamesPlay/llm-cmd-comp

Looks from the demo like mine's a little less automatic and more iterative that yours.

kristopolous

Interesting! I like it!

The conversational context is nice. The ongoing command building is convenient and the # syntax carryover makes a lot of sense!

My next step is recursion and composability. I want to be able to do things contextualized. Stuff like this:

   $ echo PUBLIC_KEY=(( get the users public key pertaining to the private key for this repo )) >> .env

or some other contextually complex thing that is actually fairly simple, just tedious to code. Then I want that <as the code> so people collectively program and revise stuff <at that level as the language>.

Then you can do this through composability like so:

    with ((find the variable store for this repo by looking in the .gitignore)) as m:
      ((write in the format of m))SSH_PUBLICKEY=(( get the users public key pertaining to the private key for this repo ))

or even recursively:

    (( 
      (( 
        ((rsync, rclone, or similar)) with compression 
      ))  
        $HOME exclude ((find directories with secrets))         
        ((read the backup.md and find the server)) 
        ((make sure it goes to the right path))
    ));

it's not a fully formed syntax yet but then people will be able to do something like:

    $ llm-compile --format terraform --context my_infra script.llm > some_code.tf

and compile publicly shared snippets as specific to their context and you get abstract infra management at a fractional complexity.

It's basically GCC's RTL but for LLMs.

The point of this approach is your building blocks remain fairly atomic simple dumb things that even a 1b model can reliably handle - kinda like the guarantee of the RTL.

Then if you want to move from terraform to opentofu or whatever, who cares ... your stuff is in the llm metalanguage ... it's just a different compile target.

It's kinda like PHP. You just go along like normal and occasionally break form for the special metalanguage whenever your hit a point of contextual variance.

rglynn

Ah this is great, in combo with something like superwhisper you can use voice for longer queries.

rcarmo

Okay, this is very cool.

simonw

Wow, that library is looking really great!

I think I want a plugin hook that lets plugins take over the display of content by the tool.

Just filed an issue: https://github.com/simonw/llm/issues/1112

Would love to get your feedback on it, I included a few design options but none of them feel 100% right to me yet.

kristopolous

The real solution is semantic routing. You want to be able to define routing rules based on something like mdast (https://github.com/syntax-tree/mdast) . I've built a few hacked versions. This would not only allow for things like terminal rendering but is also a great complement to tool calling. Being able to siphon and multiplex inputs for the future where cerebras like speeds become more common, dynamic configurable stream routing will unlock quite a bit more use cases.

We have cost, latency, context window and model routing but I haven't seen anything semantic yet. Someone's going to do it, might as well be me.

rpeden

Neat! I've written streaming Markdown renderers in a couple of languages for quickly displaying streaming LLM output. Nice to see I'm not the only one! :)

kristopolous

It's a wildly nontrivial problem if you're trying to only be forward moving and want to minimize your buffer.

That's why everybody else either rerenders (such as rich) or relies on the whole buffer (such as glow).

I didn't write Streamdown for fun - there are genuinely no suitable tools that did what I needed.

Also various models have various ideas of what markdown should be and coding against CommonMark doesn't get you there.

Then there's other things. You have to check individual character width and the language family type to do proper word wrap. I've seen a number of interesting tmux and alacritty bugs in doing multi language support

The only real break I do is I render h6 (######) as muted grey.

Compare:

    for i in $(seq 1 6); do 
      printf "%${i}sh${i}\n\n-----\n" | tr " " "#"; 
    done | pv -bqL 30 | sd -w 30

to swapping out `sd` with `glow`. You'll see glow's lag - waiting for that EOF is annoying.

Also try sd -b 0.4 or even -b 0.7,0.8,0.8 for a nice blue. It's a bit easier to configure than the usual catalog of themes that requires a compilation after modification like with pygments.

icarito

That's right this is a nontrivial problem that I struggled with too for gtk-llm-chat! I resolved it using the streaming markdown-it-py library.

hanatanaka1984

Interesting, I will be sure to check into this. I have been using llm and bat with syntax highlighting.

kristopolous

Do you just do

| bat --language=markdown --force-colorization ?

hanatanaka1984

A simple bash script provides quick command line access to the tool. Output is paged syntax highlighted markdown.

  echo "$@" | llm "Provide a brief response to the question, if the question is related to command provide the command and short description" | bat --plain -l md

Lauch as:

  llmquick "why is the sky blue?"

hanatanaka1984

| bat -p -l md

simple and works well.

nbbaier

Ohh I've wanted this so much! Thank you!

tantalor

This greatly opens up the risk of footguns.

The doc [1] warns about prompt injection, but I think a more likely scenario is self-inflicted harm. For instance, you give a tool access to your brokerage account to automate trading. Even without prompt injection, there's nothing preventing the bot from making stupid trades.

[1] https://llm.datasette.io/en/stable/tools.html

simonw

> This greatly opens up the risk of footguns.

Yeah, it really does.

There are so many ways things can go wrong once you start plugging tools into an LLM, especially if those tool calls are authenticated and can take actions on your behalf.

The MCP world is speed-running this right now, see the GitHub MCP story from yesterday: https://news.ycombinator.com/item?id=44097390

I stuck a big warning in the documentation and I've been careful not to release any initial tool plugins that can cause any damage - hence my QuickJS sandbox one and SQLite plugin being read-only - but it's a dangerous space to be exploring.

(Super fun and fascinating though.)

kbelder

If you hook an llm up to your brokerage account, someone is being stupid, but it ain't the bot.

isaacremuant

You think "senior leadership/boards of directors" aren't thinking of going all in with AI to "save money" and "grow faster and cheaper"?

This is absolutely going to happen at a large scale and then we'll have "cautionary tales" and a lot of "compliance" rules.

zaik

Let it happen. Just don't bail them out using tax money again.

theptip

Evolution in action. This is what the free market is good for.

mike_hearn

Yes, sandboxing will be crucial. On macOS it's not that hard, but there aren't good easy to use tools available for it right now. Claude Code has started using Seatbelt a bit to optimize the UX.

arendtio

I think the whole footgun discussion misses the point. Yes, you can shoot yourself in the foot (and probably will), but not evaluating the possibilities is also a risk. Regular people tend to underestimate the footgun potential (probably driven by fear of missing out) and technical people tend to underestimate the risk of not learning the new possibilities.

Even a year ago I let LLMs execute local commands on my laptop. I think it is somewhat risky, but nothing harmful happened. You also have to consider what you are prompting. So when I prompt 'find out where I am and what weather it is going to be', it is possible that it will execute rm -rf / but very unlikely.

However, speaking of letting an LLMs trade stocks without understanding how the LLM will come to a decision... too risky for my taste ;-)

shepherdjerred

Any tool can be misused

yard2010

This is not misuse. This is equivalent to a driller that in some cases drills the hand holding it.

theptip

A band saw is more dangerous than a spoon.

tantalor

You're missing the point. Most tools are deployed by humans. If they do something bad, we can blame the human for using the tool badly. And we can predict when a bad choice by the human operator will lead to a bad outcome.

Letting the LLM run the tool unsupervised is another thing entirely. We do not understand the choices the machines are making. They are unpredictable and you can't root-cause their decisions.

LLM tool use is a new thing we haven't had before, which means tool misuse is a whole new class of FUBAR waiting to happen.

johnisgood

But why can we not hold humans responsible in the case of LLM? You do have to go out of your way to do all of these things with an LLM. And it is the human that does it. It is the humans that give it the permission to act on their behalf. We can definitely hold humans responsible. The question is: are we going to?

abc-1

[flagged]

dang

Could you please stop posting shallow dismissals and putdowns of other people and their work? It's against the site guidelines, and your account has unfortunately been doing a lot of it:

https://news.ycombinator.com/item?id=44073456

https://news.ycombinator.com/item?id=44073413

https://news.ycombinator.com/item?id=44070923

https://news.ycombinator.com/item?id=44070514

https://news.ycombinator.com/item?id=44010921

https://news.ycombinator.com/item?id=43970274

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

abc-1

[dead]

tantalor

People are already doing that!

ekianjo

Natural selection at play

icarito

For all of you using `llm` - perhaps take a look at [Gtk-llm-chat](https://github.com/icarito/gtk-llm-chat).

I put a lot of effort into it - it integrates with `llm` command line tool and with your desktop, via a tray icon and nice chat window.

I recently released 3.0.0 with packages for all three major desktop operating systems.

kristopolous

Interesting. What do you use it for beyond the normal chatting

icarito

I sometimes use llm from the command line, for instance with a fragment, or piping a resource from the web with curl, and then pick up the cid with `llm gtk-chat --cid MYCID`.

kristopolous

I'm actually planning on abandoning Simon's infra soon. I want a multi-stream, routing based solution that is more aware of the modern API advancements.

The Unix shell is good at being the glue between programs. We've increased the dimensionality with LLMs.

Some kind of ports based system like named pipes with consumers and producers.

Maybe something like gRPC or NATS (https://github.com/nats-io). MQTT might also work. Network transparent would be great.

howmayiannoyyou

This is pretty great.

nlh

Ok this is great and perfect timing -- I've been playing around with Warp (the terminal) and while I love the idea of their terminal-based "agent" (eg tool loop), I don't love the whole Cursor-esque model of "trust us we'll make a good prompt and LLM calls for you" (and charge you for it), so I was hoping for a simple CLI-based terminal agent just to solve for my lack of shell-fu.

I am keenly aware this is a major footgun here, but it seems that a terminal tool + llm would be a perfect lightweight solution.

Is there a way to have llm get permission for each tool call the way other "agents" do this? ("llm would like to call `rm -rf ./*` press Y to confirm...")

Would be a decent way to prevent letting an llm run wild on my terminal and still provide some measure of protection.

andresnds

Isn’t that how the default way codex CLI runs? I.e. without passing —full-auto

losvedir

Very cool!

I've wondered how exactly, say, Claude Code knows about and uses tools. Obviously, an LLM can be "told" about tools and how to use them, and the harness can kind of manage that. But I assumed Claude Code has a very specific expectation around the tool call "API" that the harness uses, probably reinforced very heavily by some post-training / fine tuning.

Do you think your 3rd party tool-calling framework using Claude is at any disadvantage to Anthropic's own framework because of this?

Separately, on that other HN post about the GitHub MCP "attack", I made the point that LLMs can be tricked into using up to the full potential of the credential. GitHub has fine-grained auth credentials, and my own company does as well. I would love for someone to take a stab at a credential protocol that the harness can use to generate fine-grained credentials to hand to the LLM. I'm envisioning something where the application (e.g. your `llm` CLI tool) is given a more powerful credential, and the underlying LLM is taught how to "ask for permission" for certain actions/resources, which the user can grant. When that happens the framework gets the scoped credential from the service, which the LLM can then use in tool calls.

simonw

That credentials trick is possible right now using LLM's tool support. You'd have to write a pretty elaborate tool setup which exposes the "ask for more credentials" tool and then prompts the user when that's called. The tool should keep the credentials and never pass the actual tokens back to the LLM, but it can pass e.g. a symbol "creds1" and tell the LLM to request to make calls with "creds1" in future requests.

prettyblocks

I've been trying to maintain a (mostly vibe-coded) zsh/omz plugin for tab completions on your LLM cli and the rate at which you release new features makes it tough to keep up!

Fortunately this gets me 90% of the way there:

llm -f README.md -f llm.plugin.zsh -f completions/_llm -f https://simonwillison.net/2025/May/27/llm-tools/ "implement tab completions for the new tool plugins feature"

My repo is here:

https://github.com/eliyastein/llm-zsh-plugin

And again, it's a bit of a mess, because I'm trying to get as many options and their flags as I can. I wouldn't mind if anyone has any feedback for me.

sillysaurusx

Kind of crazy this isn’t sci-fi, it’s just how coding is done now. Future generations are going to wonder how we ever got anything done, the same way we wonder how assembly programmers managed to.

kristopolous

it makes simple things easy but hard things impossible. We'll see.

therouwboat

I'm wondering why you need LLM to show you how to use variables in shell scripts when you apparently make shell scripts everyday.

It's like if you use english everyday, but don't bother to learn the language because you have google translate (and now AI).

xk_id

The transition from assembly to C was to a different layer of abstraction within the same context of deterministic computation. The transition from programming to LLM prompting is to a qualitatively different context, because the process is no longer deterministic, nor debuggable. So your analogy fails to apply in a meaningful way to this situation.

sillysaurusx

Ultimately it’s about being able to create features within a certain period of time, not just to write code. And in that context the shift from assembly to C and the shift from deterministic programming to LLM prompting seem to have similar magnitudes of impact.

pollinations

Why isn't it debuggable?

oliviergg

Thank you for this release. I believe your library is a key component to unlocking the potential of LLMs without the limitations/restricitions of existing clients.

Since you released version 0.26 alpha, I’ve been trying to create a plugin to interact with a some MCP server, but it’s a bit too challenging for me. So far, I’ve managed to connect and dynamically retrieve and use tools, but I’m not yet able to pass parameters.

simonw

Yeah I had a bit of an experiment with MCP this morning, to see if I could get a quick plugin demo out for it. It's a bit tricky! The official mcp Python library really wants you to run asyncio and connect to the server and introspect the available tools.

mihau

Hi Simon!

I'm a heavy user of the llm tool, so as soon as I saw your post, I started tinkering with MCP.

I’ve just published an alpha version that works with stdio-based MCP servers (tested with @modelcontextprotocol/server-filesystem) - https://github.com/Virtuslab/llm-tools-mcp. Very early stage, so please make sure to use with --ta option (Manually approve every tool execution).

The code is still messy and there are a couple of TODOs in the README.md, but I plan to work on it full-time until the end of the week.

Some questions:

Where do you think mcp.json should be stored? Also, it might be a bit inconvenient to specify tools one by one with -T. Do you think adding a --all-tools flag or supporting glob patterns like -T name-prefix* in llm would be a good idea?

simonw

OK this looks like a very promising start!

You're using function-based tools at the moment, hence why you have to register each one individually.

The alternative to doing that is to use what I call a "toolbox", described here: https://llm.datasette.io/en/stable/python-api.html#python-ap...

Those get you two things you need:

1. A single class can have multiple tool methods in it, you just have to specify it once 2. Toolboxes can take configuration

With a Toolbox, your plugin could work like this:

  llm -T 'MCP("path/to/mcp.json")' ...

You might even be able to design it such that you don't need a mcp.json at all, and everything gets passed to that constructor.

There's one catch: currently you would have to dynamically create the class with methods for each tool, which is possible in Python but a bit messy. I have an open issue to make that better here: https://github.com/simonw/llm/issues/1111

consumer451

Hello Simon, sorry for asking about this tangent here, but have you seen this paper? Is it as important as it appears to be? Should this metric be on all system cards?

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://arxiv.org/abs/2502.05167

simonw

I had not seen that one! That's really interesting. I'd love to see them run that against Gemini 2.5 Pro and Gemini 2.5 Flash, to my understanding they're way ahead of other models on the needle in a haystack tests these days.

consumer451

Yes, I wish their methodology was run against new models, in an on-going fashion.

rahimnathwani

Thanks for making this. I used it (0.26a0) last week to create a demo for a customer-facing chatbot using proprietary data.

The key elements I had to write:

- The system prompt

- Tools to pull external data

- Tools to do some calculations

Your library made the core functionality very easy.

Most of the effort for the demo was to get the plumbing working (a nice-looking web UI for the chatbot that would persist the conversation, update nicely if the user refreshed their browser due to a connection issue, and allow the user to start a new chat session).

I didn't know about `after_call=print`. So I'm glad I read this blog post!

hanatanaka1984

Great work Simon! I use your tool daily. Pipes and easy model switching for local (ollama) and remote makes this very easy to work with.

ttul

GPT-4.1 is a capable model, especially for structured outputs and tool calling. I’ve been using LLMs for my day to day grunt work for two years now and this is my goto as a great combination of cheap and capable.

simonw

I'm honestly really impressed with GPT-4.1 mini. It is my default from messing around by their API because it is unbelievably inexpensive and genuinely capable at most of the things I throw at it.

I'll switch to o4-mini when I'm writing code, but otherwise 4.1-mini usually does a great job.

Fun example from earlier today:

  llm -f https://raw.githubusercontent.com/BenjaminAster/CSS-Minecraft/refs/heads/main/main.css \
    -s 'explain all the tricks used by this CSS'

That's piping the CSS from that incredible CSS Minecraft demo - https://news.ycombinator.com/item?id=44100148 - into GPT-4.1 mini and asking it for an explanation.

The code is clearly written but entirely uncommented: https://github.com/BenjaminAster/CSS-Minecraft/blob/main/mai...

GPT-4.1 mini's explanation is genuinely excellent: https://gist.github.com/simonw/cafd612b3982e3ad463788dd50287... - it correctly identifies "This CSS uses modern CSS features at an expert level to create a 3D interactive voxel-style UI while minimizing or eliminating JavaScript" and explains a bunch of tricks I hadn't figured out.

And it used 3,813 input tokens and 1,291 output tokens - https://www.llm-prices.com/#it=3813&ot=1291&ic=0.4&oc=1.6 - that's 0.3591 cents (around a third of a cent).

yangikan

Thanks for this. I am planning to cancel my ChatGPT plus subscription and use something like the llm tool with the API key. For regular interactions, how do you handle context? For example, the UI allows me to ask a question, and then a followup and the context is kind of automatically handled.

yangikan

I should have RTFM https://llm.datasette.io/en/stable/usage.html#starting-an-in...

Are you aware of any user interfaces that expose some limited ChatGPT functionality using a UI, that internally uses llm. This is for my non-techie wife.

puttycat

> while minimizing or eliminating JavaScript

How come it doesn't know for sure?

simonw

Because I only showed it the CSS! It doesn't even get the HTML, it's guessed all of that exclusively from what's in the (uncommented) CSS code.

Though it's worth noting that CSS Minecraft was first released three years ago, so there's a chance it has hints about it in the training data already. This is not a meticulous experiment.

(I've had a search around though and the most detailed explanation I could find of how that code works is the one I posted on my blog yesterday - my hunch is that it figured it out from the code alone.)

swyx

nice one simon - i'm guessing this is mildly related to your observation that everyone is converging on the same set of tools? https://x.com/simonw/status/1927378768873550310

simonw

Actually a total coincidence! I have been trying to ship this for weeks.