Claude can now create and edit files
203 comments
·September 9, 2025pmx
mh-
This can't be understated. I started using it heavily earlier this summer and it felt like magic. Someone signing up now based on how I described my personal experiences with it then would think I was out of my mind. For technical tasks it has been a net negative for me for the last several weeks.
(Speaking of both Claude Code and the desktop app, both Sonnet and Opus >=4, on the Max plan.)
data-ottawa
I don’t think you’re crazy, something is off in their models.
As an example I’ve been using an MCP tool to provide table schemas to Claude for months.
There was a point where it stopped recognizing the tool unless mentioned in early August. Maybe that’s related to their degraded quality issue.
This morning after pulling the correct schema info Sonnet started hallucinating columns (from Shopify’s API docs) and added them to my query.
That’s a use case I’ve been doing daily for months and in the last few weeks has gone from consistent low supervision to flaky and low quality.
I don’t know what’s going on, Sonnet has definitely felt worse, and the timeline matches their status page incident, but it’s definitely not resolved.
Opus 4.1 also feels flaky, it feels like it’s less consistent about recalling earlier prompt details than 4.0.
I personally am frustrated that there’s no refund or anything after a month of degraded performance, and they’ve had a lot of downtime.
reissbaker
FWIW I strongly recommend using some of the recent, good Chinese OSS models. I love GLM-4.5, and Kimi K2 0905 is quite good as well.
8note
ive been thinking its that my company mcp has blown up in context size, but using claude without claude code, i get context window overflows constantly now.
another option could be a system prompt change to make it too long?
dingnuts
I have read so many anecdotes about so many models that "were great" and aren't now.
I actually think this is psychological bias. It got a few things right early on, and that's what you remember. As time passes, the errors add up, until the memory doesn't match reality. The "new shiny" feeling goes away, and you perceive it for what it really is: a kind of shitty slot machine
> personally am frustrated that there’s no refund or anything after a month of degraded performance
lol, LMAO. A company operates a shitty slot machine at a loss and you're surprised they have "issues" that reduce your usage?
I'm not paying for any of this shit until these companies figure out how to align incentives. If they make more by applying limits, or charge me when the machine makes errors, that's good for them and bad for me! Why should I continue to pay to pull on the slot machine lever?
It's a waste of time and money. I'll be richer and more productive if I just write the code myself, and the result will be better too.
trunnell
https://status.anthropic.com/incidents/72f99lh1cj2c
They recently resolved two bugs affecting model quality, one of which was in production Aug 5-Sep 4. They also wrote:
Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.
Sibling comments are claiming the opposite, attributing malice where the company itself says it was a screw up. Perhaps we should take Anthropic at its word, and also recognize that model performance will follow a probability distribution even for similar tasks, even without bugs making thing worse.kiratp
> Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.
Things they could do that would not technically contradict that:
- Quantize KV cache
- Data aware model quantization where their own evals will show "equivalent perf" but the overall model quality suffers.
Simple fact is that it takes longer to deploy physical compute but somehow they are able to serve more and more inference from a slowly growing pool of hardware. Something has to give...
mh-
The problem is twofold:
- They're reporting that only impacted Haiku 3.5 and Sonnet 4. I used neither model during the time period I'm concerned with.
- It took them a month to publicly acknowledge that issue, so now we lack confidence there isn't another underlying issue going undetected (or undisclosed, less charitably) that affects Opus.
claude_ya_
Does anyone know if this also affected Claude Sonnet models running in AWS Bedrock, or if it was just when using the model via Anthropic’s API?
pc86
I hesitate to use phrases like "bait and switch" but it seems like every model gets released and is borderline awe-inspiring, then as adoption increases, and load increases, it's like it gets hit in the head with a hammer and is basically useless for anything beyond a multi-step google search.
dingnuts
I think it's a psychological bias of some sort. When the feeling of newness wears off and you realize the model is still kind of shit, you have an imperfect memory of the first few uses when you were excited and have repressed the failures from that period. As the hype wears off you become more critical and correctly evaluate the model
otabdeveloper4
No, that's just the normal slope of the hype curve as you start figuring out how the man behind the curtain operates.
rootnod3
AI is not useful in the long term is is unsustainable. News at 11.
j45
It’s important to jump on new models super early while the rails get out in.
Anyone remember GPT4 the day it launched? :)
pqdbr
Same here. Even with Opus in Claude Code I'm getting terrible results, sometimes feeling we went back to the GPT 3.5 eon. And it seems they are implementing heavily token-saving measures: the model does not read context anymore unless you force it to, making up method calls as it goes.
mh-
The simplest thing I frequently ask of regular Claude (not Code) in the desktop app:
"Use your web search tool to find me the go-to component for doing xyz in $language $framework. Always link the GitHub repo in your response."
Previously Sonnet 4 would return a good answer to this at least 80% of the time.
Now even Opus 4.1 with extended thinking frequently ignores my ask for it to use the search tool, which allows it to hallucinate a component in a library. Or maybe an entire repo.
It's gone backwards severely.
(If someone from Anthropic sees this, feel free to reach out for chat IDs/share links. I have dozens.)
j45
I’m running into this as well.
Might be Claude optimizing for general use cases compared to code and that affecting the code side?
Feels strange, because Claude api isn’t the same as the web tool so I didn’t expect Claude code to be the same.
It might be a case of having to learn to read Claude best practice docs and keep up with them. Normally I’d have Claude read them itself and update an approach to use. Not sure that works as well anymore.
alvis
And lest we forget opus was accidentally dumber last week! https://status.anthropic.com/incidents/72f99lh1cj2c
allisdust
Yup. Opus 4.1 has been feeling like absolute dog shit and it made me give up in frustration several times. They really did downgrade their models. Max plan is a joke now. I'm barely using Pro level tokens since its a net negative on my productivity. Enshittification is now truly in place.
OtomotO
This, so much this...
I signed up for Claude over a week ago and I totally regret it!
Previously I was using it and some ChatGPT here and there (also had a subscription in the past) and I felt like Claude added some more value.
But it's getting so unstable. It generates code, I see it doing that, and then it throws the code away and gives me the previous version of something 1:1 as a new version.
And then I have to waste CO2 to tell it to please don't do that and then sometimes it generates what I want, sometimes it just generates it again, just to throw it away immediately...
This is soooooooo annoying and the reason I canceled my subscription!
brandon272
> But it's getting so unstable. It generates code, I see it doing that, and then it throws the code away and gives me the previous version of something 1:1 as a new version.
I've had the same experience. Totally unreliable.
yazanobeidi
Have you run into the bug where claude acts as if it updated the artifact, but it didn’t? You can see the changes in real time, but then suddenly it’s all deleted character by character as if the backspace was held down, you’re left with the previous version, but claude carries on as if everything is fine. If you point it out, it will acknowledge this, try again, and… same thing. The only reliable fix I’ve seen is to ask it to generate a new artifact with that content and the updates. Talk about wasting tokens, and no refunds, no support, you’re on your own entirely. It’s unclear how they can seriously talk about releasing this feature when there are fundamental issues with their existing artifact creation and editing abilities.
mh-
Yes, just had it happen a couple nights ago with a simple one pager I asked it to generate from some text in a project. It couldn't edit the existing artifact (I could see it being confused as to why the update wasn't taking in the CoT), so it made a new version for every incremental edit. Which of course means there were other changes too, since it was generating from scratch each time.
paranoidrobot
Yes, this was so frustrating.
I had to keep prompting it to generate new artifacts all the time.
Thankfuly that is mostly gone with Claude Code.
j45
Yes, this has been happening a lot more the past 8 weeks.
From troubleshooting Claude by reviewing it's performance and digging in multiple times why it did what it did, it seems useful to make sure the first sentence is a clearer and completer instruction instead of breaking it up.
As models optimize resources, prompt engineering seems to become relevant again.
ACCount37
Anthropic claims that they don't degrade models under load, and the performance issues were a result of a system error:
https://status.anthropic.com/incidents/72f99lh1cj2c
That being said, they still have capacity issues on any day of the week that ends in Y. No clue how long would that take to resolve.
mh-
Not nitpicking, but they said:
> we never intentionally degrade model quality as a result of demand or other factors
Fully giving them the benefit of the doubt, I still think that still allows for a scenario like "we may [switch to quantized models|tune parameters], but our internal testing showed that these interventions didn't materially affect end user experience".
I hate to parse their words in this way, because I don't know how they could have phrased it that closed the door on this concern, but all the anecdata (personal and otherwise) suggests something is happening.
ACCount37
"Anecdata" is notoriously unreliable when it comes to estimating AI performance over time.
Sure, people complain about Anthropic's AI models getting worse over time. As well as OpenAI's models getting worse over time. But guess what? If you serve them open weights models, they also complain about models getting worse over time. Same exact checkpoint, same exact settings, same exact hardware.
Relative LMArena metrics, however, are fairly consistent across time.
The takeaway is that users are not reliable LLM evaluators.
My hypothesis is that users have a "learning curve", and get better at spotting LLM mistakes over time - both overall and for a specific model checkpoint. Resulting in increasingly critical evaluations over time.
SparkyMcUnicorn
"or other factors" is pretty catch-all in my opinion.
> I don't know how they could have phrased it that closed the door on this concern
Agreed. A full legal document would probably be the only way to convince everyone.
j45
Wording definitely could be clearer.
Intentionally might mean manually, or maybe the system does it on it's own when it thinks it's best.
fragmede
> Last week, we opened an incident to investigate degraded quality in some Claude model responses. We found two separate issues that we’ve now resolved.
pmx
Frankly, I don't believe their claims that they don't degrade the models. I know we see models as less intelligent as we get used to them and their novelty wears off but I've had to entirely give up on Claude as a coding assistant because it seems to be incapable of following instructions anymore.
SparkyMcUnicorn
I'd believe a lot of other claims before believing model degradation was happening.
- They admittedly go off of "vibes" for system prompt updates[0]
- I've seen my coworkers making a lot of bad config and CLAUDE.md updates, MCP server span, etc. and claiming the model got worse. After running it with a clean slate, they redacted their claims.
ncrtower
The same experience here: Claude with the pro plan over the summer was really doing a good job. The last 4 weeks? Constant slow-downs or API errors, more halucinating then before, and many mistakes. It appears to me that they are throttling to handle loads that they can't actually handle.
j45
Last 4 weeks have been awful, I have barely used my max in comparison to the month before and it's an active deterrent to use it because you don't know if it's going to work or hit an unpredictable limit before getting to the bottom of getting something working.
I don't feel Claude would do this intentionally, and am reminded how I kept Claude for use for some things but not generally.
furyofantares
Some of this has gotta be people asking more of it than they did before, and some has gotta be people who happened to use it for things it's good at to begin with and are now asking it things it's bad at (not necessarily harder things, just harder for the model).
However there have been some bugs causing performance degradation acknowledged by Anthropic as well (and fixed) and so I would guess there's a good amount of real degradation still if people are still seeing issues.
I've seen a lot of people switching to codex cli, and yesterday I did too, for now my 200/mo goes to OpenAI. It's quite good and I recommend it.
rapind
What makes it particularly tricky to evaluate is that there could still be other bugs given how long these went without even acknowledgement until now, and they did state they are still looking into potential Opus issues.
I'll probably come back and try a Claude Code subscription again, but I'm good for the time being with the alternative I found. I also kind of suspect the subscription model isn't going to work for me long term and instead the pay per use approach (possibly with reserved time like we have for cloud compute) where I can swap models with low friction is far more appealing.
data-ottawa
Benchmarks are too expensive for ordinary users to run, but it would be useful if they could publish their benchmarks using prod over time, that would expose degradations in a more objective manner.
Of course there’s always the problem of teaching to the test and out of test degradations, but presumably bugs would be independent of that.
bobbylarrybobby
Their iOS app could use some serious love. Not only does it have no offline capabilities (you can't even read your previous chats), if you're using the app and go offline, it puts up a big “connection lost; retry” alert and won't let you interact with the app until you get internet again. That means if you're mid prompt, you're blocked from editing further, and if you're reading a response, you have to wait until you get cell service again to continue reading.
It's one thing to not cache things for offline use, but it's quite another to intentionally unload items currently in use just because the internet connection dropped!
FitchApps
Time to revisit the infamous "3 to 6 months, AI will be writing 90% of the code" statement. I wonder how the team is doing and what % of code is being written by AI at Athropic.
https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...
syntaxing
I wonder if their API model is different from the subscription model. People called me crazy saying how GitHub copilot is better than Clause code but since I started using Claude code these past 3 weeks, times and times again, copilot + Claude sonnet 4 is better
typpilol
Agreed.. copilot is way better
j45
API has always been a little different.
Might be worth trying Claude through Amazon as well.
simonw
I just published an extensive review of the new feature, which is actually Claude Code Interpreter (the official name, bafflingly, is Upgraded file creation and analysis - that's what you turn on in the features page at least).
I reverse-engineered it a bit, figured out its container specs, used it to render a PDF join diagram for a SQLite database and then re-ran a much more complex "recreate this chart from this screenshot and XLSX file" example that I previously ran against ChatGPT Code Interpreter last night.
Here's my review: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
mdaniel
> Version Control
> github.com
pour one out for the GitLab hosted projects, or its less popular friends hosted on bitbucket, codeberg, forgejo, sourceforge, sourcehut, et al. So dumb.
spike021
For the past two to three weeks I've noticed Claude just consistently lagging or potentially even being throttled for pretty minor coding or CLI tasks. It'll basically stop showing any progress for at least a couple minutes. Sometimes exiting the query and re-trying gets it to work but other times it keeps happening. I pay for Pro so I don't think it's just API rate limiting.
Would appreciate if that could be fixed but of course new features are more interesting for them to prioritize.
lordnacho
This will either result in a lot of people being able to sleep more, or an absolute avalanche of crap is about to be released upon society.
A lot of the people I graduated with spent their 20s making powerpoint and excel. There would be people with a master's in engineering getting phone calls at 1am, with an instruction to change the fonts on slide 75, or to slightly modify some calculation. Most of the real decision making was, funnily enough, not based on these documents. But it still meant people were working 100 hour weeks.
I could see this resulting in the same work being done in a few minutes. But I could also see it resulting in the MDs asking for 10x the number of slide decks.
bobbylarrybobby
When the word processor was first invented, people didn't end up printing less because of how easy it was to edit and view documents live — they printed more because of how little friction there was (compared to a typewriter) between making a change and pressing print.
I think we're going to see the same thing with document creation. Could LLMs help make a small number of high quality documents? Yes, with some coaching and planning from the user. But instead people will use them to quickly create a crappy document, get feedback that it's crappy, and then immediately create an only slightly less crappy doc.
thatfrenchguy
10x as much useless work, guaranteed. Remind me in ten years :)
grim_io
I guess it will decrease the need for custom software. If this is reliable, excel will be "enough" for longer.
mattnewton
The global economy has been down the rabbit hole and through the looking glass into the land of the red queen as far as I’ve known.
“Now here you see, it takes all the running you can do, to keep in the same place” as she says.
I fully believe any slack this creates will get gobbled up in competition in a few years.
devinprater
Maybe one day Claude can rewrite its interface to be more accessible to blind people like me.
crazygringo
What is inaccessible about it? It's kind of hard to discuss without any particulars.
ctoth
Curious what a11y issues you see with Claude? I use it a remarkable amount and haven't found any showstoppers. Web interface and Claude Code.
josu
>issues you see
None?
a3w
When will the blind see Reason?
I mean, Mr. Reason is standing right there!
TNDnow
upvoted my wholesome sir
visarga
Claude has no TTS while most LLMs have it. It makes the text more accessible.
bobbylarrybobby
The iOS just gained tts, although for some reason it doesn't use the voice mode voice and sounds really really bad. But it's technically there.
SAI_Peregrinus
Anthropic are looking to make money. They need to make absolutely absurd amounts of money to afford the R&D expenses they've already incurred. Features get prioritized based on how much money they might make. Unless forced to by regulation (or maybe social pressure on the executives, but that really only comes from their same class instead of the general public these days) smaller groups of customers get served last. There aren't that many blind people, so there's not very much profit incentive to serve blind people. Unless they're actually violating the ADA or another law or regulation, and can't bribe the regulators for less than the cost of fines or fixing the issue, I'd not expect any improvement.
googlryas
Their app being top of the line, because they coded their app in their app, would certainly be a nice natural endorsement of the product.
simonw
This feature is a little confusing.
It looks to me like a variant of the Code Interpreter pattern, where Claude has a (presumably sandboxed) server-side container environment in which it can run Python. When you ask it to make a spreadsheet it runs this:
pip install openpyxl pandas --break-system-packages
And then generates and runs a Python script.What's weird is that when you enable it in https://claude.ai/settings/features it automatically disables the old Analysis tool - which used JavaScript running in your browser. For some reason you can have one of those enabled but not both.
The new feature is being described exclusively as a system for creating files though! I'm trying to figure out if that gets used for code analysis too now, in place of the analysis tool.
simonw
It works for me on the https://claude.ai web all but doesn't appear to work in the Claude iOS app.
I tried "Tell me everything you can about your shell and Python environments" and got some interesting results after it ran a bunch of commands.
Linux runsc 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 24.04.2 LTS
Python 3.12.3
/usr/bin/node is v18.19.1
Disk Space: 4.9GB total, with 4.6GB available
Memory: 9.0GB RAM
Attempts at making HTTP requests all seem to fail with a 403 error. Suggesting some kind of universal proxy.
But telling it to "Run pip install sqlite-utils" worked, so apparently they have allow-listed some domains such as PyPI.
I poked around more and found these environment variables:
HTTPS_PROXY=http://21.0.0.167:15001
HTTP_PROXY=http://21.0.0.167:15001
On further poking, some of the allowed domains include github.com and pypi.org and registry.npmjs.org - the proxy is running Envoy.Anthropic have their own self-issued certificate to intercept HTTPS.
simonw
Turns out the allowlist is fully documented here: https://support.anthropic.com/en/articles/12111783-create-an...
simonw
This is now an extensive blog post: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
amilios
Anyone else having serious reliability issues with artifact editing? I find that the artifacts quite often get "stuck", where the LLM is trying to edit the artifact but the state of the artifact does not change. Seems like the LLM is somehow failing in editing the artifact silently, while thinking that it is actually doing the edits. The way to resolve this is to ask Claude to make a new artifact, which then has all the changes Claude thought it was making. But you have to do this relatively often.
jononor
Yes every 10 edits or so. Super annoying. It is limiting how often I bother using the tool
tkgally
I have had the same problem with artifacts, and I had similar problems several months ago with Claude Desktop. I stopped using those features mostly and use Claude Code instead. I don't like CC's terminal interface, but it has been more reliable for me.
josvdwest
Anyone know if this can write scripts or any text file to your device?
wolfgangbabad
My experience is similar. At first Claude was super smart and get even very complicated things right. Now even super simple tasks are almost impossible to finish right, even if I really chop things into small steps. Also it's much slower even on Pro account than a few weeks ago.
mikewarot
Not Claude specific, but related to the agent model of things...
I've been paying $10/month for GitHub Copilot, which I use via Microsoft's Visual Studio Code, and about a month ago, they added ChatGPT5 (preview), which uses the agent model of interaction. It's a qualitative jump that I'm still learning to appreciate in full.
It seems like the worst possible thing, in terms of security, to let an LLM play with your stuff, but I really didn't understand just how much easier it could be to work with an LLM if it's an agent. Previously I'd end up with a blizzard of python error messages, and just give up on a project, now it fixes it's own mess. What a relief!
hu3
Yeah in agent mode it compiles code and runs tests, if anything breaks it attempts to fix. Kinda wild to see at first.
forgotusername6
In agent mode there is a whitelist of commands in the VScode settings that it can do without confirmation. When I went to edit that file, copilot suggested adding "rm -rf *".
randomNumber7
This must be a mistake. It should be "rm -rf /*"
amelius
I'd probably install a snapshotting filesystem before I let it change stuff on my system (such as installing packages and such).
ffsm8
That's what devcontainers are for. You create the config and the editor runs effectively inside of the docker container. Works surprisingly good. Vscode for example even Auto proxies opened ports inside of the container to the host etc.
Will also make using Linux tooling a lot easier on non- Linux hosts like Windows/MacOS
divan
Oh, nice! One of my biggest issues with mainstream LLMs/apps was that working on the long text (article, script, documentation, etc.) is limited to copy-pasting dance. Which is especially frustrating in comparison to the AI coding assistants that can work on code directly in the file system, using the internet and MCPs at the same time.
I just tried this new feature to work on a text document in a project, and it's a big difference. Now I really want to have this feature (for text at least) in ChatGPT to be able to work on documents through voice and without looking at the screen.
throwmeaway222
This some kind of headline from a year ago or somethin
They need to focus on fixing reliability first. Their systems constantly go down and it appears they are having to quantise the models to keep up with demand, reducing intelligence significantly. New features like this feel pointless when the underlying model is becoming unusable.