Skip to content(if available)orjump to list(if available)

Enough AI copilots, we need AI HUDs

piker

Absolutely agree, and spellchecker is a great analogy.

I've recently been snoozing co-pilot for hours at a time in VS Code because some recent update seems to have added a ton of latency to my keystrokes. Instead, it turns out that `rust_analyzer` is actually all that I need. Go-to definition and hover-over give me exactly what the article describes: extra senses.

ChatGPT and Claude are great as assistants for strategizing problems, but even the typeahead value seems to me negligible in a large enough project.

This is the thesis for Tritium[1], going back to its founding. Lawyers' time and attention (like fighter pilots') is critical but they're still required in the cockpit. And there's some argument they always will be.

[1] https://news.ycombinator.com/item?id=44256765 ("an all-in-one drafting cockpit")

furyofantares

I'm very curious if a toggle would be useful that would display a heatmap of a source file showing how surprising each token is to the model. Red tokens are more likely to be errors, bad names, or wrong comments.

nextaccountic

Even if something is surprising just because it's a novel algorithm, it warrants better documentation - but commenting the code explaining how it works will make the code itself less surprising!

In short, it's probably possible (and it's maybe a good engineering practice) to structure the source such as no specific part is really surprising

It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system

Kichererbsen

I often find myself leaving review comments on pull requests where I was surprised. I'll state as much: This surprised me - I was expecting XYZ at this point. Or I wasn't expecting X to be in charge of Y.

dclowd9901

That's a really cool idea. Also the inverse, where suggestions from the AI were similarly heat mapped for confidence would be extremely useful.

ijk

I want that in an editor. It's also a good way to check if your writing is too predictable or cliche.

The perplexity calculation isn't difficult; just need to incorporate it into the editor interface.

digdugdirk

Interesting! I've often felt that we aren't fully utilizing the "low hanging fruit" from the early days of the LLM craze. This seems like one of those ideas.

jama211

That’s actually fantastic as an idea

WithinReason

previously undefined variable and function names would be red as well

utf_8x

You know that feature in JetBrains (and possibly other) IDEs that highlights non-errors, like code that could be optimized for speed or readability (inverting ifs, using LINQ instead of a foreach, and so on)? As far as I can tell, these are just heuristics, and it feels like the perfect place for an “AI HUD.”

I don’t use Copilot or other coding AIs directly in the IDE because, most of the time, they just get in the way. I mainly use ChatGPT as a more powerful search engine, and this feels like exactly the kind of IDE integration that would fit well with my workflow.

cadamsdotcom

Love the idea & spitballing ways to generalize to coding..

Thought experiment: as you write code, an LLM generates tests for it & the IDE runs those tests as you type, showing which ones are passing & failing, updating in real time. Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

The tests could appear in a separated panel next to your code, and pass/fail status in the gutter of that panel. As simple as red and green dots for tests that passed or failed in the last run.

The presence or absence and content of certain tests, plus their pass/fail state, tells you what the code you’re writing does from an outside perspective. Not seeing the LLM write a test you think you’ll need? Either your test generator prompt is wrong, or the code you’re writing doesn’t do the things you think they do!

Making it realtime helps you shape the code.

Or if you want to do traditional TDD, the tooling could be reversed so you write the tests and the LLM makes them pass as soon as you stop typing by writing the code.

callc

Humans writing the test first and LLM writing the code is much better than the reverse. And that is because tests are simply the “truth” and “intention” of the code as a contract.

When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

JimDabell

> When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

You don’t need to write tests for that, you need to write acceptance criteria.

ThunderSizzle

As in, a developer would write something in e.g. gherkin, and AI would automatically create the matching unit tests and the production code?

That would be interesting. Of course, gherkin tends to just be transpiled into generated code that is customized for the particular test, so I'm not sure how AI can really abstract it away too much.

motorest

> You don’t need to write tests for that, you need to write acceptance criteria.

Sir, those are called tests.

kamaal

>>Humans writing the test first and LLM writing the code is much better than the reverse.

Isn't that logic programming/Prolog?

You basically write the sequence of conditions(i.e tests in our lingo) that have to be true, and the compiler(now AI) generates code for your.

Perhaps there has to be a relook on how Logic programming can be done in the modern era to make this more seamless.

William_BB

There's no way this would work for any serious C++ codebase. Compile times alone make this impossible

I'm also not sure how LLM could guess what the tests should be without having written all of the code, e.g. imagine writing code for a new data structure

motorest

> There's no way this would work for any serious C++ codebase. Compile times alone make this impossible

There's nothing in C++ that prevents this. If build times are your bogeyman, you'd be pleased to know that all mainstream build systems support incremental builds.

William_BB

The original example was (paraphrasing) "rerunning 10-100 tests that take 1ms after each keystroke".

Even with incremental builds, that surely does not sound plausible? I only mentioned C++ because that's my main working language, but this wouldn't sound reasonable for Rust either, no?

motorest

> Thought experiment: as you write code, an LLM generates tests for it & the IDE runs those tests as you type, showing which ones are passing & failing, updating in real time. Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

I think this is a bad approach. Tests enforce invariants, and they are exactly the type of code we don't want LLMs to touch willy-nilly.

You want your tests to only change if you explicitly want them to, and even then only the tests should change.

Once you adopt that constraint, you'll quickly realize ever single detail of your thought experiment is already a mundane workflow in any developer's day-to-day activities.

Consider the fact that watch mode is a staple of any JavaScript testing framework, and those even found their way into .NET a couple of years ago.

So, your thought experiment is something professional software developers have been doing for what? A decade now?

cadamsdotcom

I think tests should be rewritten as much as needed. But to counter the invariant part, maybe let the user zoom back and forth through past revisions and pull in whatever they want to the current version, in case something important is deleted? And then allow “pinning” of some stuff so it can’t be changed? Would that solve for your concerns?

cjonas

Then do you need tests to validate your tests are correct, otherwise the LLM might just generate passing code even if the test is bad? Or write code that games the system because it's easier to hardcode an output value then to do the actual work.

There probably is a setup where this works well, but the LLM and humans need to be able to move across the respective boundaries fluidly...

Writing clear requirements and letting the AI take care of the bulk of both sides seems more streamlined and productive.

cellis

The harder part is “test invalidation”. For instance if a feature no longer makes sense, the human / test validator must painstakingly go through and delete obsolete specs. An idea I’d like to try is to “separate” the concerns; only QA agents can delete specs, engineer agents must conform to the suite, and make a strong case to the qa agent for deletion.

andsoitis

> Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

Doesn’t seem like high ROI to run full suite of tests on each keystroke. Most keystrokes yield an incomplete program, so you want to be smarter about when you run the tests to get a reasonably good trade off.

scottgg

WallabyJS does something along these lines, although I don’t think it is contextually understanding which tests to highlight

https://wallabyjs.com/

hnthrowaway121

Yes the reverse makes much more sense to me. AI help to spec out the software & then the code has an accepted definition of correctness. People focus on this way less than they should I think

kn81198

About a decade back Bret Victor [1] talked about how his principle in life is to reduce the delay in feedback, and having faster iteration cycles not just helps in doing things (coding) better but also contributes to new creative insights. He had a bunch of examples built to showcase alternative ways of coding, which is very close to being HUDs - one example shown in the OP is very similar to the one he presents to "step through time to figure out the working of the code".

[1]: https://www.youtube.com/watch?v=PUv66718DII

Animats

Nobody mentioned Manna [1] yet? That suggests a mostly audio headset giving orders. There is a real-world version using AR glasses.[2]

[1] https://marshallbrain.com/manna1

[2] https://www.six-15.com/vision-picking

hi_hi

Doesn't it all come down to "what is the ideal interface for humans to deal with digital information"?

We're getting more and more information thrown at us each day, and the AIs are adding to that, not reducing it. The ability to summarise dense and specialist information (I'm thinking error logs, but could be anything really) just means more ways for people to access and view that information who previously wouldn't.

How do we, as individuals, best deal with all this information efficiently? Currently we have a variety of interfaces, websites, dashboards, emails, chat. Are all these necessary anymore? They might be now, but what about the next 10 years. Do I even need to visit a companies website if can get the same information from some single chat interface?

The fact we have AIs building us websites, apps, web UI's just seems so...redundant.

AlotOfReading

Websites were a way to get authoritative information about a company, from that company (or another trusted source like Wikipedia). That trust is powerful, which is why we collectively spent so much time trying to educate users about the "line of death" in browsers, drawing padlock icons, chasing down impersonator sites, mitigating homoglyph attacks, etc. This all rested on the assumption that certain sites were authoritative sources of information worth seeking out.

I'm not really sure what trust means in a world where everyone relies uncritically on LLM output. Even if the information from the LLM is usually accurate, can I rely on that in some particularly important instance?

hi_hi

You raise a good point, and one I rarely see discussed.

I still believe it fundamentally comes down to an interface issue, but how trust gets decoupled from the interface (as you said, the padlock shown in the browser and certs to validate a website source), thats an interesting one to think about :-)

null

[deleted]

energy123

The designers of 6th gen fighter jets are confronting the same challenge. The cockpit, which is an interface between the pilot and the airframe, will be optionally manned. If the cockpit is manned, the pilot will take on a reduced set of roles focused on higher-level decision making.

By the 7th generation it's hard to see how humans will still be value-add, unless it's for international law reasons to keep a human in the loop before executing the kill chain, or to reduce Skynet-like tail risks in line with Paul Christiano's arms race doom scenario.

Perhaps interfaces in every domain will evolve this way. The interface will shrink in complexity, until it's only humans describing what they want to the system, at higher and higher levels of abstraction. That doesn't necessarily have to be an English-language interface if precision in specification is required.

guardiang

Every human is different, don't generalize the interface. Dynamically customize it on the fly.

sipjca

yep I think this is the fundamental question as well, everything else is intermediate

moomoo11

I like the smartphone. It’s honestly perfect and underutilized.

sothatsit

AI building complex visualisations for you on-the-fly seems like a great use-case.

For example, if you are debugging memory leaks in a specific code path, you could get AI to write a visualisation of all the memory allocations and frees under that code path to help you identify the problem. This opens up an interesting new direction where building visualisations to debug specific problems is probably becoming viable.

This idea reminds me of Jonathan Blow's recent talk at LambdaConf. In it, he shows a tool he made to visualise his programs in different ways to help with identifying potential problems. I could imagine AI being good at building these. The talk: https://youtu.be/IdpD5QIVOKQ?si=roTcCcHHMqCPzqSh&t=1108

aantix

I would love to see a HUD that allows me to see the change that corresponds to Claude Code's TODO item.

I don't want inline comments as those accumulate, don't get cleaned up appropriately by the LLM.

ravila4

I think one key reason HUDs haven’t taken off more broadly is the fundamental limitation of our current display medium - computer screens and mobile devices are terrible at providing ambient, peripheral information without being intrusive. When I launch an AI agent to fix a bug or handle a complex task, there’s this awkward wait time where it takes too long for me to sit there staring at the screen waiting for output, but it’s too short for me to disengage and do something else meaningful. A HUD approach would give me a much shorter feedback loop. I could see what the AI is doing in my peripheral vision and decide moment-to-moment whether to jump in and take over the coding myself, or let the agent continue while I work on something else. Instead of being locked into either “full attention on the agent” or “completely disengaged,” I’d have that ambient awareness that lets me dynamically choose my level of involvement. This makes me think VR/AR could be the killer application for AI HUDs. Spatial computing gives us the display paradigm where AI assistance can be truly ambient rather than demanding your full visual attention on a 2D screen. I picture that this would be especially helpful for help on more physical tasks, such as cooking, or fixing a bike.

elliotec

You just described what I do with my ultrawide monitor and laptop screen.

I can be fully immersed in a game or anything and keep Claude in a corner of a tmux window next to a browser on the other monitor and jump in whenever I see it get to the next step or whatever.

ravila4

It’s a similar idea, but imagine you could fire off a task, and go for a run, or do the dishes. Then be notified when it completes, and have the option to review the changes, or see a summary of tests that are failing, without having to be at your workstation.

bigyabai

I kinda do this today, with Alpaca[0]'s sandboxed terminal runner and GSConnect[1] syncing the response notifications to my phone over LAN.

[0] https://jeffser.com/alpaca/

[1] https://github.com/GSConnect/gnome-shell-extension-gsconnect

ankit219

The current paradigm is driven by two factors: one is the reliability of the models and that constraints how much autonomy you can give to an agent. Second is about chat as a medium which everyone went to because ChatGPT became a thing.

I see the value in HUDs, but only when you can be sure output is correct. If that number is only 80% or so, copilots work better so that humans in the loop can review and course correct - the pair programmer/worker. This is not to say we need ai to get to higher levels of correctness inherently, just that systems deployed need to do so before they display some information on HUD.

psychoslave

This is missing the addictive/engaging part of a conversational interface for most people out there. Which is in line with the critics highlighted in the fine article.

Just because most people are fond of it doesn't actually mean it improves their life, goals and productivity.

benjaminwootton

I think there is a third and distinct model which is AI that runs in the background autonomously amd over a long period and pushes things to you.

It can detect situations intelligently, do the filtering, summarisation of what’s happening and possibly a recommendation.

This feels a lot more natural to me, especially in a business context when you want to monitor for 100 situations about thousands of customers.

skydhash

Actually defining those situations and collecting the data (which should help identify those situations) are the hard parts. Having an autonomous system that do it has been solved for ages.