Claude Sonnet 4 now supports 1M tokens of context
507 comments
·August 12, 2025aliljet
ants_everywhere
> I really desperately need LLMs to maintain extremely effective context
The context is in the repo. An LLM will never have the context you need to solve all problems. Large enough repos don't fit on a single machine.
There's a tradeoff just like in humans where getting a specific task done requires removing distractions. A context window that contains everything makes focus harder.
For a long time context windows were too small, and they probably still are. But they have to get better at understanding the repo by asking the right questions.
sdesol
> But they have to get better at understanding the repo by asking the right questions.
How I am tackling this problem is making it dead simple for users to create analyzers that are designed to enriched text data. You can read more about how it would be used in a search at https://github.com/gitsense/chat/blob/main/packages/chat/wid...
The basic idea is, users would construct analyzers with the help of LLMs to extract the proper metadata that can be semantically searched. So when the user does an AI Assisted search with my tool, I would load all the analyzers (description and schema) into the system prompt and the LLM can determine which analyzers can be used to answer the question.
A very simplistic analyzer would be to make it easy to identify backend and frontend code so you can just use the command `!ask find all frontend files` and the LLM will construct a deterministic search that knows to match for frontend files.
onion2k
Large enough repos don't fit on a single machine.
I don't believe any human can understand a problem if they need to fit the entire problem blem domain in their head, and the scope of a domain that doesn't fit on a computer. You have to break it down into a manageable amount of information to tackle it in chunks.
If a person can do that, so can an LLM prompted to do that by a person.
mock-possum
> The context is in the repo
Agreed but that’s a bit different from “the context is the repo”
It’s been my experience that usually just picking a couple files out to add to the context is enough - Claude seems capable of following imports and finding what it needs, in most cases.
I’m sure it depends on the task, and the structure of the codebase.
stuartjohnson12
> An LLM will never have the context you need to solve all problems.
How often do you need more than 10 million tokens to answer your query?
ants_everywhere
I exhaust the 1 million context windows on multiple models multiple times per day.
I haven't used the Llama 4 10 million context window so I don't know how it performs in practice compared to the major non-open-source offerings that have smaller context windows.
But there is an induced demand effect where as the context window increases it opens up more possibilities, and those possibilities can get bottlenecked on requiring an even bigger context window size.
For example, consider the idea of storing all Hollywood films on your computer. In the 1980s this was impossible. If you store them in DVD or Bluray quality you could probably do it in a few terabytes. If you store them in full quality you may be talking about petabytes.
We recently struggled to get a full file into a context window. Now a lot of people feel a bit like "just take the whole repo, it's only a few MB".
rootnod3
Flooding the context also means increasing the likelihood of the LLM confusing itself. Mainly because of the longer context. It derails along the way without a reset.
Wowfunhappy
I keep reading this, but with Claude Code in particular, I consistently find it gets smarter the longer my conversations go on, peaking right at the point where it auto-compacts and everything goes to crap.
This isn't always true--some conversations go poorly and it's better to reset and start over--but it usually is.
aliljet
How do you know that?
giancarlostoro
Here's a paper from MIT that covers how this could be resolved in an interesting fashion:
https://hanlab.mit.edu/blog/streamingllm
The AI field is reusing existing CS concepts for AI that we never had hardware for, and now these people are learning how applied Software Engineering can make their theoretical models more efficient. It's kind of funny, I've seen this in tech over and over. People discover new thing, then optimize using known thing.
rootnod3
The longer the context and the discussion goes on, the more it can get confused, especially if you have to refine the conversation or code you are building on.
Remember, in its core it's basically a text prediction engine. So the more varying context there is, the more likely it is to make a mess of it.
Short context: conversion leaves the context window and it loses context. Long context: it can mess with the model. So the trick is to strike a balance. But if it's an online models, you have fuck all to control. If it's a local model, you have some say in the parameters.
anonz4FWNqnX
I've had similar experiences. I've gone back and forth between running models locally and using the commercial models. The local models can be incredibly useful (gemma, qwen), but they need more patience and work to get them to work.
One advantage to running locally[1] is that you can set the context length manually and see how well the llm uses it. I don't have an exact experience to relay, but it's not unusual for models to be allow longer contexts, but ignore that context.
Just making the context big doesn't mean the LLM is going to use it well.
[1] I've using lm studio on both a macbook air and a macbook pro. Even a macbook air with 16G can run pretty decent models.
F7F7F7
What do you think happens when things start falling outside of its context window? It loses access to parts of your conversation.
And that’s why it will gladly rebuild the same feature over and over again.
lightbendover
[dead]
alexchamberlain
I'm not sure how, and maybe some of the coding agents are doing this, but we need to teach the AI to use abstractions, rather than the whole code base for context. We as humans don't hold the whole codebase in our hear, and we shouldn't expect the AI to either.
LinXitoW
They already do, or at least Claude Code does. It will search for a method name, then only load a chunk of that file to get the method signature, for example.
It will use the general information you give it to make educated guesses of where things are. If it knows the code is Vue based and it has to do something with "users", it might seach for "src/*/User.vue.
This is also the reason why the quality of your code makes such a large difference. The more consistent the naming of files and classes, the better the AI is at finding them.
anthonypasq
the fact we cant keep the repo in our working memory is a flaw of our brains. i cant see how you could possibly make the argument that if you were somehow able to keep the entire codebase in your head that it would be a disadvantage.
SkyBelow
Information tradeoff. Even if you could keep the entire code base in memory, if something else has to be left out of memory, then you have to consider the value of an abstraction verses whatever other information is lost. Abstractions also apply to the business domain and works the same.
You also have time tradeoffs. Like time to access memory and time to process that memory to achieve some outcome.
There is also quality. If you can keep the entire code base in memory but with some chance of confusion, while abstractions will allow less chance of confusion, then the tradeoff of abstractions might be worth it still.
Even if we assume a memory that has no limits, can access and process all information at constant speed, and no quality loss, there is still communication limitations to worry about. Energy consumption is yet another.
null
sdesol
LLMs (current implementation) are probabilistic so it really needs the actual code to predict the most likely next tokens. Now loading the whole code base can be a problem in itself, since other files may negatively affect the next token.
photon_lines
Sorry -- I keep seeing this being used but I'm not entirely sure how it differs from most of human thinking. Most human 'reasoning' is probabilistic as well and we rely on 'associative' networks to ingest information. In a similar manner - LLMs use association as well -- and not only that, but they are capable of figuring out patterns based on examples (just like humans are) -- read this paper for context: https://arxiv.org/pdf/2005.14165. In other words, they are capable of grokking patterns from simple data (just like humans are). I've given various LLMs my requirements and they produced working solutions for me by simply 1) including all of the requirements in my prompt and 2) asking them to think through and 'reason' through their suggestions and the products have always been superior to what most humans have produced. The 'LLMs are probabilistic predictors' comments though keep appearing on threads and I'm not quite sure I understand them -- yes, LLMs don't have 'human context' i.e. data needed to understand human beings since they have not directly been fed in human experiences, but for the most part -- LLMs are not simple 'statistical predictors' as everyone brands them to be. You can see a thorough write-up I did of what GPT is / was here if you're interested: https://photonlines.substack.com/p/intuitive-and-visual-guid...
nomel
No, it doesn’t, nor do we. It’s why abstractions and documentations exist.
If you know what a function achieves, and you trust it to do that, you don’t need to see/hold its exact implementation in your head.
siwatanejo
I do think AIs are already using abstractions, otherwise you would be submitting all the source code of your dependencies into the context.
TheOtherHobbes
I think they're recognising patterns, which is not the same thing.
Abstractions are stable, they're explicit in their domains, good abstractions cross multiple domains, and they typically come with a symbolic algebra of available operations.
Math is made of abstractions.
Patterns are a weaker form of cognition. They're implicit, heavily context-dependent, and there's no algebra. You have to poke at them crudely in the hope you can make them do something useful.
Using LLMs feels more like the latter than the former.
If LLMs were generating true abstractions they'd be finding meta-descriptions for code and language and making them accessible directly.
AGI - or ASI - may be be able to do that some day, but it's not doing that now.
F7F7F7
There are a billion and one repos that claim to help do this. Let us know when you find one.
throwaway314155
/compact in Claude Code is effectively this.
brulard
Compact is a reasonable default way to do that, but quite often it discards important details. It's better to have CC to store important details, decisions and reasons in a document where it can be reviewed and modified if needed.
benterix
> it's not clear if the value actually exists here.
Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.
I will give it another run in 6-8 months though.
ericmcer
Agreed, daily Cursor user.
Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why... Ended up ripping about 200 LoC out because what Claude "fixed" wasn't even broken.
So never let it generate code, but the autocomplete is absolutely killer. If you understand how to code in 2+ languages you can make assumptions about how to do things in many others and let the AI autofill the syntax in. I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.
daymanstep
> I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.
I do wonder whether your code does what you think it does. Similar-sounding keywords in different languages can have completely different meanings. E.g. the volatile keyword in Java vs C++. You don't know what you don't know, right? How do you know that the AI generated code does what you think it does?
qingcharles
The other day I caught it changing the grammar and spelling in a bunch of static strings in a totally different part of a project, for no sane reason.
senko
> Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why...
But .. that's not the AI's fault. If people submit any PRs (including AI-generated or AI-assisted) without completely understanding them, I'd treat is as serious breach of professional conduct and (gently, for first-timers) stress that this is not acceptable.
As someone hitting the "Create PR" (or equivalent) button, you accept responsibility for the code in question. If you submit slop, it's 100% on you, not on any tool used.
epolanski
You're blaming the tool and not the tool user.
cambaceres
For me it’s meant a huge increase in productivity, at least 3X.
Since so many claim the opposite, I’m curious to what you do more specifically? I guess different roles/technologies benefit more from agents than others.
I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.
wiremine
> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.
> For me it’s meant a huge increase in productivity, at least 3X.
How do we reconcile these two comments? I think that's a core question of the industry right now.
My take, as a CTO, is this: we're giving people new tools, and very little training on the techniques that make those tools effective.
It's sort of like we're dropping trucks and airplanes on a generation that only knows walking and bicycles.
If you've never driven a truck before, you're going to crash a few times. Then it's easy to say "See, I told you, this new fangled truck is rubbish."
Those who practice with the truck are going to get the hang of it, and figure out two things:
1. How to drive the truck effectively, and
2. When NOT to use the truck... when talking or the bike is actually the better way to go.
We need to shift the conversation to techniques, and away from the tools. Until we do that, we're going to be forever comparing apples to oranges and talking around each other.
rs186
3X if not 10X if you are starting a new project with Next.js, React, Tailwind CSS for a fullstack website development, that solves an everyday problem. Yeah I just witnessed that yesterday when creating a toy project.
For my company's codebase, where we use internal tools and proprietary technology, solving a problem that does not exist outside the specific domain, on a codebase of over 1000 files? No way. Even locating the correct file to edit is non trivial for a new (human) developer.
elevatortrim
I think there are two broad cases where ai coding is beneficial:
1. You are a good coder but working on a new (to you) or building a new project, or working with a technology you are not familiar with. This is where AI is hugely beneficial. It does not only accelerate you, it lets you do things you could not otherwise.
2. You have spent a lot of time on engineering your context and learning what AI is good at, and using it very strategically where you know it will save time and not bother otherwise.
If you are a really good coder, really familiar with the project, and mostly changing its bits and pieces rather than building new functionality, AI won’t accelerate you much. Especially if you did not invest the time to make it work well.
acedTrex
I have yet to get it to generate code past 10ish lines that I am willing to accept. I read stuff like this and wonder how low yall's standards are, or if you are working on projects that just do not matter in any real world sense.
nicce
> I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.
I think this is your answer. For example, React and JavaScript are extremely popular and aged. Are you using TypeScript and want to get most of the types or are you accepting everything that LLM gives as JavaScript? How interested you are about the code whether it is using "soon to be deprecated" functions or the most optimized loop/implementation? How about the project structure?
In other cases, the more precision you need, less effective LLM is.
thanhhaimai
I work across the stack (frontend, backend, ML)
- For FrontEnd or easy code, it's a speed up. I think it's more like 2x instead of 3x.
- For my backend (hard trading algo), it has like 90% failure rate so far. There is just so much for it to reason through (balance sheet, lots, wash, etc). All agents I have tried, even on Max mode, couldn't reason through all the cases correctly. They end up thrashing back and forth. Gemini most of the time will go into the "depressed" mode on the code base.
One thing I notice is that the Max mode on Cursor is not worth it for my particular use case. The problem is either easy (frontend), which means any agent can solve it, or it's hard, and Max mode can't solve it. I tend to pick the fast model over strong model.
epolanski
> Since so many claim the opposite
The overwhelming majority of those claiming the opposite are a mixture of:
- users with wrong expectations, such as AI's ability to do the job on its own with minimal effort from the user. They have marketers to blame.
- users that have AI skill issues: they simply don't understand/know how to use the tools appropriately. I could provide countless examples from the importance of quality prompting, good guidelines, context management, and many others. They have only their laziness or lack of interest to blame.
- users that are very defensive about their job/skills. Many feel threatened by AI taking their jobs or diminishing it, so their default stance is negative. They have their ego to blame.
bcrosby95
My current guess is it's how the programmer solves problems in their head. This isn't something we talk about much.
People seem to find LLMs do well with well-spec'd features. But for me, creating a good spec doesn't take any less time than creating the code. The problem for me is the translation layer that turns the model in my head into something more concrete. As such, creating a spec for the LLM doesn't save me any time over writing the code myself.
So if it's a one shot with a vague spec and that works that's cool. But if it's well spec'd to the point the LLM won't fuck it up then I may as well write it myself.
flowerthoughts
What type of work do you do? And how do you measure value?
Last week I was using Claude Code for web development. This week, I used it to write ESP32 firmware and a Linux kernel driver. Sure, it made mistakes, but the net was still very positive in terms of efficiency.
verall
> This week, I used it to write ESP32 firmware and a Linux kernel driver.
I'm not meaning to be negative at all, but was this for a toy/hobby or for a commercial project?
I find that LLMs do very well on small greenfield toy/hobby projects but basically fall over when brought into commercial projects that often have bespoke requirements and standards (i.e. has to cross compile on qcc, comply with autosar, in-house build system, tons of legacy code laying around maybe maybe not used).
So no shade - I'm just really curious what kind of project you were able get such good results writing ESP32 FW and kernel drivers for :)
9cb14c1ec0
The more I use Claude Code, the more aware I become of its limitations. On the whole, it's a useful tool, but the bigger the codebase the less useful. I've noticed a big difference on its performance on projects with 20k lines of code versus 100k. (Yes, I know. A 100k line project is still very small in the big picture)
Aeolun
I think one of thr big issues with CC is that it’ll read the first occurence of something, and then think it’s found it. Never mind that there are 17 instances spread throughout the codebase.
I have to be really vigilant and tell it to search the codebase for any duplication, then resolve it, if I want it to keep going good at what it does.
greenie_beans
same. agents are good with easy stuff and debugging but extremely bad with complexity. has no clue about chesterson's fence, and it's hard to parse the results especially when it creates massive diffs. creates a ton of abandoned/cargo code. lots of misdirection with OOP.
chatting witch claude and copy/pasting code between my IDE and claude is still the most effective for more complex stuff, at least for me.
mikepurvis
For a bit more nuance, I think I would my overall net is about break even. But I don't take that as "it's not worth it at all, abandon ship" but rather that I need to hone my instinct of what is and is not a good task for AI involvement, and what that involvement should look like.
Throwing together a GHA workflow? Sure, make a ticket, assign it to copilot, check in later to give a little feedback and we're golden. Half a day of labour turned into fifteen minutes.
But there are a lot of tasks that are far too nuanced where trying to take that approach just results in frustration and wasted time. There it's better to rely on editor completion or maybe the chat interface, like "hey I want to do X and Y, what approach makes sense for this?" and treat it like a rubber duck session with a junior colleague.
mark_l_watson
I am sort of with you. I am down to asking Gemini Pro a couple of questions a day, use ChatGPT just a few times a week, and about once a week use gemini-cli (either a short free session, or a longer session where I provide my API key.)
That said I spend (waste?) an absurdly large amount of time each week experimenting with local models (sometimes practical applications, sometimes ‘research’).
meowtimemania
For me it depends on the task. For some tasks (maybe things that don't have good existing examples in my codebase?)
I'll spend 3x the time repeatedly asking claude to do something for me
seanmmward
The primary use case isn't just about shoving more code in context, although depending on the task, there is an irredicible minimum context needed for it to capture all the needed understanding. The 1M context model is a unique beast in terms of how you need to feed it, and its real power is being able to tackle long horizon tasks which require iterative exploration, in context learning, and resynthesis. Ie, some problems are breadth (go fix an api change in 100 files), other however require depth (go learn from trying 15 different ways to solve this problem). 1M Sonnet is unique in its capabilities for the latter in particular.
hinkley
Sounds to me like your problem has shifted from how much the AI tool costs per hour to how much it costs per token because resetting a model happens often enough that the price doesn't amortize out per hour. That giant spike every ?? months overshadows the average cost per day.
I wonder if this will become more universal, and if we won't see a 'tick-tock' pattern like Intel used, where they tweak the existing architecture one or more times between major design work. The 'tick' is about keeping you competitive and the 'tock' is about keeping you relevant.
sdesol
> I really desperately need LLMs to maintain extremely effective context
I actually built this. I'm still not ready to say "use the tool yet" but you can learn more about it at https://github.com/gitsense/chat.
The demo link is not up yet as I need to finalize an admin tool but you should be able to follow the npm instructions to play around with.
The basic idea is, you should be able to load your entire repo or repos and use the context builder to help you refine it. Or you can can create custom analyzers that you can do 'AI Assisted' searches with like execute `!ask find all frontend code that does [this]` and the because the analyzer knows how to extract the correct metadata to support that query, you'll be able to easily build the context using it.
msikora
Why not build this as an MCP so that people can plug it into their favorite platform?
sdesol
An MCP is definitely on the roadmap. My objective is to become the context engine for LLMs so having a MCP is required. However, there will be things from a UX perspective that you'll lose out on if you just use the MCP.
hirako2000
Not clear how it gets around what is, ultimately, a context limit.
I've been fiddling with some process too, would be good if you shared the how. The readme looks like yet another full fledged app.
sdesol
Yes there is a context window limit, but I've found for most frontier models, you can generate very effective code if the context window is under 75,000 tokens provided the context is consistent. You have to think of everything from a probability point of view and the more logical the context, the greater the chances of better code.
For example, if the frontend doesn't need to know the backend code (other than the interface) not including the backend code to solve a frontend one to solve a specific problem can reduce context size and improve the chances of expected output. You just need to ensure you include the necessary interface documenation.
As for the full fledged app, I think you raised a good point and I should add a 'No lock in' section for why to use it. The app has a message tool that lets you pick and choose what messages to copy. Once you've copied the context (including any conversation messages that can help the LLM), you can use the context where ever you want.
My strategy with the app is to be the first place you goto to start a conversation before you even generate code, so my focus is helping you construct contexts (the smaller the better) to feed into LLMs.
handfuloflight
Doesn't Claude Code do all of this automatically?
sdesol
I haven't looked at Claud Code, so I don't know if they have analyzers or not that understands how to extract any type of data other than specific coding data that it is trained on. Based on the runtime for some tasks, I would not be surprised if it is going through all the files and asking "is this relevant"
My tool is mainly targeted at massive code bases and enterprise as I still believe the most efficient way to build accurate context is by domain experts.
Right now, I would say 95% of my code is AI generated (98% human architectured) and I am spending about $2 a day on LLM costs and the code generation part usually never runs more than 30 seconds for most tasks.
kvirani
Wait that's not how Cursor etc work? (I made assumptions)
sdesol
I don't use Cursor so I can't say, but based on what I've read, they optimize for smaller context to reduce cost and probably for performance. The issue is, I think this is severely flawed as LLMs are insanely context sensitive and forgetting to include a reference file can lead to undesirable code.
I am obviously biased, but I still think to get the best results, the context needs to be human curated to ensure everything the LLM needs will be present. LLMs are probabilistic, so the more relevant context, the greater the chances the final output is the most desired.
trenchpilgrim
Dunno about Cursor but this is exactly how I use Zed to navigate groups of projects
TZubiri
"However. Price is king. Allowing me to flood the context window with my code base is great"
I don't vibe code, but in general having to know all of the codebase to be able to do something is a smell, it's spagghetti, it's lack of encapsulation.
When I program I cannot think about the whole database, I have a couple of files open tops and I think about the code in those files.
This issue of having to understand the whole codebase, complaining about abstractions, microservices, and OOP, and wanting everything to be in a "simple" monorepo, or a monolith; is something that I see juniors do, almost exclusively.
gdudeman
A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max):
1. Build context for the work you're doing. Put lots of your codebase into the context window.
2. Do work, but at each logical stopping point hit double escape to rewind to the context-filled checkpoint. You do not spend those tokens to rewind to that point.
3. Tell Claude your developer finished XYZ, have it read it into context and give high level and low level feedback (Claude will find more problems with your developer's work than with yours).
If you want to have multiple chats running, use /resume and pull up the same thread. Hit double escape to the point where Claude has rich context, but has not started down a specific rabbit hole.nojs
In my experience jumping back like this is risky unless you explicitly tell it you made changes, otherwise they will get clobbered because it will update files based on the old context.
Telling it to “re-read” xyz files before starting works though.
FajitaNachos
What's the benefit to using claude code CLI directly over something like Cursor?
oars
I tell Claude that it wrote XYZ in another session (I wrote it) then use that context to ask questions or make changes.
Wowfunhappy
I do this all the time and it sometimes works, but it's not a silver bullet. Sometimes Claude benefits from having the full conversation.
gdudeman
I'll note this saves a lot of wait time as well! No sitting there while a new Claude builds context from scratch.
i_have_an_idea
This sounds like the programmer equivalent of astrology.
> Build context for the work you're doing. Put lots of your codebase into the context window.
If you don’t say that, what do you think happens as the agent works on your codebase.
seperman
Very interesting. Why does Claude find more problems if we mention the code is written by another developer?
mcintyre1994
Total guess, but maybe it breaks it out of the sycophancy that most models seem to exhibit?
I wonder if they’d also be better at things like telling you an idea is dumb if you tell it it’s from someone else and you’re just assessing it.
bgilly
In my experience, Claude will criticize others more than it will criticize itself. Seems similar to how LLMs in general tend to say yes to things or call anything a good idea by default.
I find it to be an entertaining reflection of the cultural nuances embedded into training data and reinforcement learning processes.
gdudeman
Claude is very agreeable and is an eager helper.
It gives you the benefit of the doubt if you're coding.
It also gives you the benefit of the doubt if you're looking for feedback on your developers work. If you give it a hint of distrust "my developer says they completed this, can you check and make sure, give them feedback....?" Claude will look out for you.
rvnx
Thank you for the tips, do you know how to rollback latest changes ? Trying very hard to do it, but seems like Git is the only way ?
rtuin
Quick tip when working with Claude Code and Git: When you're happy with an intermediate result, stage the changes by running `git add` (no commit). That makes it possible to always go back to the staged changes when Claude messes up. You can then just discard the unstaged changes and don't have to roll back to the latest commit.
SparkyMcUnicorn
I haven't used it, but saw this the other day: https://github.com/RonitSachdev/ccundo
lpa22
One of the most helpful usages of CC so far is when I simply ask:
"Are there any bugs in the current diff"
It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think through for correctness.
KTibow
I'm surprised that works even without telling it to think/think hard/think harder/ultrathink.
bertil
That matches my experience with non-coding tasks: it’s not very creative, but it’s a comprehensive critical reader.
swyx
maybe want to reify that as a claude code hook!
not_that_d
My experience with the current tools so far:
1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "it works".
2. When working with languages or frameworks that I know, I find it makes me unproductive, the amount of time I spend writing a good enough prompt with the correct context is almost the same or more that if I write the stuff myself and to be honest the solution that it gives me works for this specific case but looks like a junior code with pitfalls that are not that obvious unless you have the experience to know it.
I used it with Typescript, Kotlin, Java and C++, for different scenarios, like websites, ESPHome components (ESP32), backend APIs, node scripts etc.
Botton line: usefull for hobby projects, scripts and to prototypes, but for enterprise level code it is not there.
brulard
For me it was like this for like a year (using Cline + Sonnet & Gemini) until Claude Code came out and until I learned how to keep context real clean. The key breakthrough was treating AI as an architect/implementer rather than a code generator.
Most recently I ask first CC to create a design document for what we are going to do. He has instructions to look into the relevant parts of the code and docs to reference them. I review it and few back-and-forths we have defined what we want to do. Next step is to chunk it into stages and even those to smaller steps. All this may take few hours, but after this is well defined, I clear the context. I then let him read the docs and implement one stage. This goes mostly well and if it doesn't I either try to steer him to correct it, or if it's too bad, I improve the docs and start this stage over. After stage is complete, we commit, clear context and proceed to next stage.
This way I spend maybe a day creating a feature that would take me maybe 2-3. And at the end we have a document, unit tests, storybook pages, and features that gets overlooked like accessibility, aria-things, etc.
At the very end I like another model to make a code review.
Even if this didn't make me faster now, I would consider it future-proofing myself as a software engineer as these tools are improving quickly
aatd86
For me it's the opposite. As long as I ask for small tasks, or error checking, it can help. But I'd rather think of the overall design myself because I tend to figure out corner cases or superlinear complexities much better. I develop better mental models than the NNs. That's somewhat of a relief.
Also the longer the conversation goes, the less effective it gets. (saturated context window?)
brulard
I don't think thats the opposite. I have an idea what I want and to some extent how I want it to be done. The design document starts with a brainstorming where I throw all my ideas at the agent and we iterate together.
> Also the longer the conversation goes, the less effective it gets. (saturated context window?)
Yes, this is exactly why I said the breakthrough came for me when I learned how to keep the context clean. That means multiple times in the process I ask the model to put the relevant parts of our discussion into an MD document, I may review and edit it and I reset the context with /clear. Then I have him read just the relevant things from MD docs and we continue.
john-tells-all
I've seen this referred to as Chain of Thought. I've used it with great success a few times.
ramshanker
Same here. A small variation: I explicitly use website to manage what context it gets to see.
brulard
What do you mean by website? An HTML doc?
imiric
This is a common workflow that most advanced users are familiar with.
Yet even following it to a T, and being really careful with how you manage context, the LLM will still hallucinate, generate non-working code, steer you into wrong directions and dead ends, and just waste your time in most scenarios. There's no magical workflow or workaround for avoiding this. These issues are inherent to the technology, and have been since its inception. The tools have certainly gotten more capable, and the ecosystem has matured greatly in the last couple of years, but these issues remain unsolved. The idea that people who experience them are not using the tools correctly is insulting.
I'm not saying that the current generation of this tech isn't useful. I've found it very useful for the same scenarios GP mentioned. But the above issues prevent me from relying on it for anything more sophisticated than that.
brulard
> These issues are inherent to the technology
That's simply false. Even if LLMs don't produce correct and valid code on first shot 100% times of the cases, if you use an agent, it's simply a matter of iterations. I have claude code connected to Playwright, context7 for docs and to Playwright, so it can iterate by itself if there are syntax errors, runtime errors or problems with the data on the backend side. Currently I have near zero cases when it does not produce valid working code. If it is incorrect in some aspect, it is then not that hard to steer it to better solution or to fix yourself.
And even if it failed in implementing most of these stages of the plan, it's not all wasted time. I brainstormed ideas, formed the requirements, specifications to features and have clear documentation and plan of the implementation, unit tests, etc. and I can use it to code it myself. So even in the worst case scenario my development workflow is improved.
viccis
I agree. For me it's a modern version of that good ol "rails new" scaffolding with Ruby on Rails that got you started with a project structure. It makes sense because LLMs are particularly good at tasks that require little more knowledge than just a near perfect knowledge of the documentation of the tooling involved, and creating a well organized scaffold for a greenfield project falls squarely in that area.
For legacy systems, especially ones in which a lot of the things they do are because of requirements from external services (whether that's tech debt or just normal growing complexity in a large connected system), it's less useful.
And for tooling that moves fast and breaks things (looking at you, Databricks), it's basically worthless. People have already brought attention to the fact that it will only be as current as its training data was, and so if a bunch of terminology, features, and syntax have changed since then (ahem, Databricks), you would have to do some kind of prompt engineering with up to date docs for it to have any hope of succeeding.
pvorb
I'm wondering what exact issue you are referring to with Databricks? I can't remember a time I had to change a line I wrote during the past 2.5 years I've been using it. Or are you talking about non-breaking changes?
jeremywho
My workflow is to use Claude desktop with the filesystem mcp server.
I give claude the full path to a couple of relevant files related to the task at hand, ie where the new code should hook into or where the current problem is.
Then I ask it to solve the task.
Claude will read the files, determine what should be done and it will edit/add relevant files. There's typically a couple of build errors I will paste back in and have it correct.
Current code patterns & style will be maintained in the new code. It's been quite impressive.
This has been with Typescript and C#.
I don't agree that what it has produced for me is hobby-grade only...
taberiand
I've been using it the same way. One approach that's worked well for me is to start a project and first ask it to analyse and make a plan with phases for what needs to be done, save that plan into the project, then get it to do each phase in sequence. Once it completes a phase, have it review the code to confirm if the phase is complete. Each phase of work and review is a new chat.
This way helps ensure it works on manageable amounts of code at a time and doesn't overload its context, but also keeps the bigger picture and goal in sight.
mnky9800n
I find that sometimes this works great and sometimes it happily tells you everything works and your code fails successfully and if you aren’t reading all the code you would never know. It’s kind of strange actually. I don’t have a good feeling when it will get everything correct and when it will fail and that’s what is disconcerting. I would be happy to be given advice on what to do to untangle when it’s good and when it’s not. I love chatting with Claude code about code. It’s annoying that it doesn’t always get it right and also doesn’t really interact with failure like a human would. At Least in my experience anyways.
JyB
That's exactly how you should do it. You can also plug in an MCP for your CI or mention cli.github.com in your prompt to also make it iterate on CI failures.
Next you use claude code instead and you make several work on their own clone on their own workspace and branches in the background; So you can still iterate yourself on some other topic on your personal clone.
Then you check out its tab from time to time and optionally checkout its branch if you'd rather do some updates yourself. It's so ingrained in my day-to-day flow now it's been super impressive.
hamandcheese
Any particular reason you prefer that over Claude code?
jeremywho
I'm on windows. Claude Code via WSL hasn't been as smooth a ride.
nwatson
One can also integrate with, say, a running PyCharm with the Jetbrains IDE MCP server. Claude Desktop can then interact directly with PyCharm.
pqs
I'm not a programmer, but I need to write python and bash programs to do my work. I also have a few websites and other personal projects. Claude Code helps me implement those little projects I've been wanting to do for a very long time, but I couldn't due to the lack of coding experience and time. Now I'm doing them. Also now I can improve my emacs environment, because I can create lisp functions with ease. For me, this is the perfect tool, because now I can do those little projects I couldn't do before, making my life easier.
chamomeal
LLMs totally kick ass for making bash scripts
dboreham
Strong agree. Bash is so annoying that there have been many scripts that I wanted to have, but just didn't write (did the thing manually instead) rather than go down the rabbit hole of Bash nonsense. LLMs turn this on its head. I probably have LLMs write 1-2 bash scripts a week now, that I commit to git for use now and later.
zingar
Big +1 to customizing emacs! Used to feel so out of reach, but now I basically rolled my own cursor.
dekhn
For context I'm a principal software engineer who has worked in and out of machine learning for decades (along with a bunch of tech infra, high performance scientific computing, and a bunch of hobby projects).
In the few weeks since I've started using Gemini/ChatGPT/Claude, I've
1. had it read my undergrad thesis and the paper it's based on, implementing correct pytorch code for featurization and training, along wiht some aspects of the original paper that I didn't include in my thesis. I had been waiting until retirement until taking on this task.
2. had it write a bunch of different scripts for automating tasks (typically scripting a few cloud APIs) which I then ran, cleaning up a long backlog of activities I had been putting off.
3. had it write a yahtzee game and implement a decent "pick a good move" feature . It took a few tries but then it output a fully functional PyQt5 desktop app that played the game. It beat my top score of all time in the first few plays.
4. tried to convert the yahtzee game to an android app so my son and I could play. This has continually failed on every chat agent I've tried- typically getting stuck with gradle or the android SDK. This matches my own personal experience with android.
5. had it write python and web-based g-code senders that allowed me to replace some tools I didn't like (UGS). Adding real-time vis of the toolpath and objects wasn't that hard either. Took about 10 minutes and it cleaned up a number of issues I saw with my own previous implementations (multithreading). It was stunning how quickly it can create fully capable web applications using javascript and external libraries.
6. had it implement a gcode toolpath generator for basic operations. At first I asked it to write Rust code, which turned out to be an issue (mainly because the opencascade bindings are incomplete), it generated mostly functional code but left it to me to implement the core algorithm. I asked it to switch to C++ and it spit out the correct code the first time. I spent more time getting cmake working on my system than I did writing the prompt and waiting for the code.
7. had it Write a script to extract subtitles from a movie, translate them into my language, and re-mux them back into the video. I was able to watch the movie less than an hour after having the idea- and most of that time was just customizing my prompt to get several refinements.
8. had it write a fully functional chemistry structure variational autoencoder that trains faster and more accurate than any I previously implemented.
9. various other scientific/imaging/photography related codes, like impleemnting multi-camera rectification, so I can view obscured objects head-on from two angled cameras.
With a few caveats (Android projects, Rust-based toolpath generation), I have been absolutely blown away with how effective the tools are (especially used in a agent which has terminal and file read/write capabilities). It's like having a mini-renaissance in my garage, unblocking things that would have taken me a while, or been so frustrating I'd give up.
I've also found that AI summaries in google search are often good enough that I don't click on links to pages (wikipedia, papers, tutorials etc). The more experience I get, the more limitations I see, but many of those limitations are simply due to the extraordinary level of unnecessary complexity required to do nearly anything on a modern computer (see my comments about about Android apps & gradle).
MangoCoffee
At the end of the day, all tools are made to make their users' lives easier.
I use GitHub Copilot. I recently did a vibe code hobby project for a command line tool that can display my computer's IP, hard drive, hard drive space, CPU, etc. GPT 4.1 did coding and Claude did the bug fixing.
The code it wrote worked, and I even asked it to create a PowerShell script to build the project for release
epolanski
I really find your experience strikingly different than mine, I'll share you my flow:
- step A: ask AI to write a featureA-requirements.md file at the root of the project, I give it a general description for the task, then have it ask me as many questions as possible to refine user stories and requirements. It generally comes up with a dozen or more of questions, of which multiples I would've not thought about and found out much later. Time: between 5 and 40 minutes. It's very detailed.
- step B: after we refine the requirements (functional and non functional) we write together a todo plan as featureA-todo.md. I refine the plan again, this is generally shorter than the requirements and I'm generally done in less than 10 minutes.
- step C: implementation phase. Again the AI does most of the job, I correct it at each edit and point flaws. Are there cases where I would've done that faster? Maybe. I can still jump in the editor and do the changes I want. This step in general includes comprehensive tests for all the requirements and edge cases we have found in step A, both functional, integration and E2Es. This really varies but it is generally highly tied to the quality of phase A and B. It can be as little as few minutes (especially true when we indeed come up with the most effective plan) and as much as few hours.
- step D: documentation and PR description. With all of this context (in requirements and todos) at this point updating any relevant documentation and writing the PR description is a very short experiment.
In all of that: I have textual files with precise coding style guidelines, comprehensive readmes to give precise context, etc that get referenced in the context.
Bottom line: you might be doing something profoundly wrong, because in my case, all of this planning, requirements gathering, testing, documenting etc is pushing me to deliver a much higher quality engineering work.
hoppp
I used it with Tyopescript and Go, SQL, Rust
Using it with rust is just horrible imho. Lots and lots of errors, I cant wait to stop this rust project already. But the project itself is quite complex
Go on the other hand is super productive, mainly because the language is already very simple. I can move 2x fast
Typescript is fine, I use it for react components and it will do animations Im lazy to do...
SQL and postgresql is fine, I can do it without it also, I just dont like to write stored functions cuz of the boilerplatey syntax, a little speed up saves me from carpal tunnel
apimade
Many who say LLMs produce “enterprise-grade” code haven’t worked in mid-tier or traditional companies, where projects are held together by duct tape, requirements are outdated, and testing barely exists. In those environments, enterprise-ready code is rare even without AI.
For developers deeply familiar with a codebase they’ve worked on for years, LLMs can be a game-changer. But in most other cases, they’re best for brainstorming, creating small tests, or prototyping. When mid-level or junior developers lean heavily on them, the output may look useful.. until a third-party review reveals security flaws, performance issues, and built-in legacy debt.
That might be fine for quick fixes or internal tooling, but it’s a poor fit for enterprise.
bityard
I work in the enterprise, although not as a programmer, but I get to see how the sausage is made. And describing code as "enterprise grade" would not be a compliment in my book. Very analogous to "contractor grade" when describing home furnishings.
Aeolun
Umm, Claude Code is a lot better than a lot of enterprise grade code I see. And it actually learns from mistakes with a properly crafted instruction xD
drums8787
My experience is the opposite I guess. I am having a great time using claude to quickly implement little "filler features" that require a good amount of typing and pulling from/editing different sources. Nothing that requires much brainpower beyond remembering the details of some sub system, finding the right files, and typing.
Once the code is written, review, test and done. And on to more fun things.
Maybe what has made it work is that these tasks have all fit comfortably within existing code patterns.
My next step is to break down bigger & more complex changes into claude friendly bites to save me more grunt work.
unlikelytomato
I wish I shared this experience. There are virtually no filter features for me to work on. When things feel like filler on my team, it's generally a sign of tech debt and we wouldn't want to have it generate all the code it would take. What are some examples of filler features for you?
On the other hand, it does cost me about 8 hours a week debugging issues created by bad autocompletes from my team. The last 6 months have gotten really bad with that. But that is a different issue.
tankenmate
This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users don't properly understand the trade off being made (if they leave it in auto mode of coding right up to the auto compact).
bachittle
As of now it's not integrated into Claude Code. "We’re also exploring how to bring long context to other Claude products". I'm sure they already know about this issue and are trying to think of solutions before letting users incur more costs on their monthly plans.
PickledJesus
Seems to be for me, I came to look at HN because I saw it was the default in CC
novaleaf
where do you see it in CC?
jasonthorsness
What do you recommend doing instead? I've been using Claude Code a lot but am still pretty novice at the best practices around this.
TheDong
Have the AI produce a plan that spans multiple files (like "01 create frontend.md", "02 create backend.md", "03 test frontend and backend running together.md"), and then create a fresh context for each step if it looks like re-using the same context is leading it to confusion.
Also, commit frequently, and if the AI constantly goes down the wrong path ("I can't create X so I'll stub it out with Y, we'll fix it later"), you can update the original plan with wording to tell it not to take that path ("Do not ever stub out X, we must make X work"), and then start a fresh session with an older and simpler version of the code and see if that fresh context ends up down a better path.
You can also run multiple attempts in parallel if you use tooling that supports that (containers + git worktrees is one way)
F7F7F7
Inventivatbly the files become a mess of their own. Changes and learnings from one part of the plan often dont result in adaptation to impacted plans down chain.
In the end you have a mish mash of half implemented plans and now you’ve lost context too. Which leads to blowing tokens on trying to figure out what’s been implemented, what’s half baked, and what was completely ignored.
Any links to anyone who’s built something at scale using this method? It always sounds good on paper.
I’d love to find a system that works.
wongarsu
Changing the prompt and rerunning is something where Cursor still has a clear edge over Claude Code. It's such a powerful technique for keeping the context small because it keeps the context clear of back-and-forths and dead ends. I wish it was more universally supported
agotterer
I use the main Claude code thread (I don’t know what to call it) for planning and then explicitly tell Claude to delegate certain standalone tasks out to subagents. The subagents don’t consume the main threads context window. Even just delegating testing, debugging, and building will save a ton context.
sixothree
/clear often is really the first tool for management. Do this when you finish a task.
dbreunig
The team at Chroma is currently looking into this and should have some figures.
nojs
Currently the quality seems to degrade long before the context limit is reached, as the context becomes “polluted”.
Should we expect the higher limit to also increase the practical context size proportionally?
firasd
A big problem with the chat apps (ChatGPT; Claude.ai) is the weird context window hijinks. Especially ChatGPT does wild stuff.. sudden truncation; summarization; reinjecting 'ghost snippets' etc
I was thinking this should be up to the user (do you want to continue this conversation with context rolling out of the window or start a new chat) but now I realized that this is inevitable given the way pricing tiers and limited computation works. Like the only way to have full context is use developer tools like Google AI Studio or use a chat app that wraps the API
With a custom chat app that wraps the API you can even inject the current timestamp into each message and just ask the LLM btw every 10 minutes just make a new row in a markdown table that summarizes every 10 min chunk
cruffle_duffle
> btw every 10 minutes just make a new row in a markdown table that summarizes every 10 min chunk
Why make it time based instead of "message based"... like "every 10 messages, summarize to blah-blah.md"?
firasd
Sure. But you'd want to help out the LLM with a message count like this is message 40, this is message 41... so when it hits message 50 it's like ahh time for a new summary and call the memory_table function (cause it's executing the earlier standing order in your prompt)
dev0p
Probably it's more cost effective and less error prone to just dump the message log rather than actively rethink the context window, costing resources and potentially losing information in the process. As the models gets better, this might change.
psyclobe
Isn’t Opus better? Whenever I run out of Opus tokens and get kicked down to Sonnet it’s quite a shock sometimes.
But man I’m at the perfect stage in my career for these tools. I know a lot, I understand a lot, I have a lot of great ideas-but I’m getting kinda tired of hammering out code all day long. Now with Claude I am just busting ass executing in all these ideas and tests and fixes-never going back!
howinator
I could be wrong, but I think this pricing is the first to admit that cost scales quadratically with number of tokens. It’s the first time I’ve seen nonlinear pricing from an LLM provider which implicitly mirrors the inference scaling laws I think we're all aware of.
jpau
Google[1] also has a "long context" pricing structure. OpenAI may be considering offering similar since they do not offer their priority processing SLAs[2] for context >128K.
[1] https://cloud.google.com/vertex-ai/generative-ai/pricing
aledalgrande
I hope that they are going to put something in Claude Code to display if you're entering the expensive window. Sometime I just keep the conversation going. I wouldn't want that to burn my Max credits 2x faster.
terminalshort
Yeah, that 1 MM tokens is a $15 (IIRC) API call. That's gonna add up quick! My favorite hypothetical AI failure scenario is that LLM agents eventually achieve human level general intelligence, but have to burn so many tokens to do it that they actually become more expensive than a human.
dang
Related ongoing thread:
Claude vs. Gemini: Testing on 1M Tokens of Context - https://news.ycombinator.com/item?id=44878999 - Aug 2025 (9 comments)
isoprophlex
1M of input... at $6/1M input tokens. Better hope it can one-shot your answer.
This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (for my day-to-day).
However. Price is king. Allowing me to flood the context window with my code base is great, but given that the price has substantially increased, it makes sense to better manage the context window into the current situation. The value I'm getting here flooding their context window is great for them, but short of evals that look into how effective Sonnet stays on track, it's not clear if the value actually exists here.