Promising results from DeepSeek R1 for code

764 comments

·January 28, 2025

anotherpaulg

> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

It's definitely possible for AI to do a large fraction of your coding, and for it to contribute significantly to "improving itself". As an example, aider currently writes about 70% of the new code in each of its releases.

I automatically track and share this stat as graph [0] with aider's release notes.

Before Sonnet, most releases were less than 20% AI generated code. With Sonnet, that jumped to >50%. For the last few months, about 70% of the new code in each release is written by aider. The record is 82%.

Folks often ask which models I use to code aider, so I automatically publish those stats too [1]. I've been shifting more and more of my coding from Sonnet to DeepSeek V3 in recent weeks. I've been experimenting with R1, but the recent API outages have made that difficult.

[0] https://aider.chat/HISTORY.html

[1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...

joshstrange

First off I want to thank you for Aider. I’ve had so much fun playing with it and using it for real work. It’s an amazing tool.

How do you determine how much was written by you vs the LLM? I assume it consists of parsing the git log and getting LoC from that or similar?

If the scripts are public could you point me at them? I’d love to run it on a recent project I did using aider.

anotherpaulg

Glad to hear you’re finding aider useful!

There’s a faq entry about how these stats are computed [0]. Basically using git blame, since aider is tightly integrated with git.

The faq links to the script that computes the stats. It’s not designed to be used on any repo, but you (or aider) could adapt it.

You’re not the first to ask for these stats about your own repo, so I may generalize it at some point.

[0] https://aider.chat/docs/faq.html#how-are-the-aider-wrote-xx-...

joshstrange

Thank you so much for linking me to that! I think an `aider stats`-type command would be really cool (it would be cool to calculate stats based activity since the first aider commit or all-time commits of the repo).

nyarlathotep_

does this mean lines/diffs otherwise untouched are considered written by Aider?

If a small change is made by an end-user to adjust an Aider result, who gets "credit"?

fsndz

I think the secret of DeepSeek is basically using RL to train a model that will generate high quality synthetic data. You then use the synthetic dataset to fine-tune a pretrained model and the result is just amazing: https://open.substack.com/pub/transitions/p/the-laymans-intr...

yoyohello13

Maybe this is answered, but I didn't see it. How does aider deal with secrets in a git repo? Like if I have passwords in a `.env`?

Edit: I think I see. It only adds files you specify.

FeepingCreature

Aider has a command to add files to the prompt. For files that are not added, it uses tree-sitter to extract a high-level summary. So for a `.env`, it will mention to the LLM the fact that the file exists, but not what is in it. If the model thinks it needs to see that file, it can request it, at which point you receive a prompt asking whether it's okay to make that file available.

It's a very slick workflow.

anotherpaulg

You can use an .aiderignore file to ensure aider doesn't use certain files/dirs/etc. It conforms to the .gitignore spec.

adrianlzt

[dead]

almostgotcaught

> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

you're assuming the PR will land:

> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.

This certainly wouldn't fly in my org (even with test coverage/passes).

Jimmc414

>> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.

> This certainly wouldn't fly in my org (even with test coverage/passes).

To be fair, this seems expected. A distilled model might struggle more with aggressive quantization (like q6) since you're stacking two forms of quality loss: the distillation loss and the quantization loss. I think the answer would be to just use the higher cost full precision model.

Philpax

llama.cpp optimises for hackability, not necessarily maintainability or cleanliness. You can look around the repository to get a feel for what I mean.

almostgotcaught

i guess that means no one should use it for anything serious? good to know

brianstrimp

> It's definitely possible for AI to do a large fraction of your coding, and for it to contribute significantly to "improving itself". As an example, aider currently writes about 70% of the new code in each of its releases.

That number itself is not saying much.

Let's say I have an academic article written in Word (yeah, I hear some fields do it like that). I get feedback, change 5 sentences, save the file. Then 20k of the new file differ from the old file. But the change I did was only 30 words, so maybe 200 bytes. Does that mean that Word wrote 99% of that update? Hardly.

Or in C: I write a few functions in which my old-school IDE did the indentation and automatic insertion of closing curly braces. Would I say that the IDE wrote part of the code?

Of course the AI supplied code is more than my two examples, but claiming that some tool wrote 70% "of the code" suggests a linear utility of the code which is just not representing reality very well.

anotherpaulg

Every metric has limitations, but git blame line counts seem pretty uncontroversial.

Typical aider changes are not like autocompleting braces or reformatting code. You tell aider what to do in natural language, like a pair programmer. It then modifies one or more files to accomplish that task.

Here's a recent small aider commit, for flavor.

  -# load these from aider/resources/model-settings.yml
  -# use the proper packaging way to locate that file
  -# ai!
  +import importlib.resources
  +
  +# Load model settings from package resource
  MODEL_SETTINGS = []
  +with importlib.resources.open_text("aider.resources", "model-settings.yml") as f:
  +    model_settings_list = yaml.safe_load(f)
  +    for model_settings_dict in model_settings_list:
  +        MODEL_SETTINGS.append(ModelSettings(**model_settings_dict))

https://github.com/Aider-AI/aider/commit/5095a9e1c3f82303f0b...

brianstrimp

Point is that not all lines are equal. The 30% that the tool didn't make are the hard stuff. Not just in line count. Once an approach or an architecture or a design are clear then implementing is merely manual labor. Progress is not linear.

You shouldn't judge your sw eng employees by lines of code either. Those that think the hard stuff often don't have that many lines of code checked in. But it's those people that are the key to your success.

stavros

That's pretty reaching though if you're comparing an AI to a formatter. Presumably 70% of a new Aider release isn't formatting.

simonw

"The stats are computed by doing something like git blame on the repo, and counting up who wrote all the new lines of code in each release. Only lines in source code files are counted, not documentation or prompt files."

reitzensteinm

R1 is available on both together.ai and fireworks.ai, it should be a drop in replacement using the OpenAI API.

SkyPuncher

The problem is it's very expensive. More expensive than Claude.

7thpower

You can use the distilled version on Groq for free for the time being. Groq is amazing but frequently has capacity issues or other random bugs.

Perhaps you could set up Groq as your primary and then fail back to fireworks, etc by using litellm or another proxy.

htrp

Run your deepseek R1 model on your own hardware.

girvo

Only various distillations are available for most people’s hardware, and they’re quite obviously not as good as actual R1 in my testing.

sampo

"$6,000 computer to run Deepseek R1 670B Q8 locally at 6-8 tokens/sec"

https://reddit.com/r/LocalLLaMA/comments/1ic8cjf/6000_comput...

hammock

That's amazing data. How representative do you think your Aider data is of all coding done?

simonw

Given these initial results, I'm now experimenting with running DeepSeek-R1-Distill-Qwen-32B for some coding tasks on my laptop via Ollama - their version of that needs about 20GB of RAM on my M2. https://www.ollama.com/library/deepseek-r1:32b

It's impressive!

I'm finding myself running it against a few hundred lines of code mainly to read its chain of thought - it's good for things like refactoring where it will think through everything that needs to be updated.

Even if the code it writes has mistakes, the thinking helps spot bits of the code I may have otherwise forgotten to look at.

lacedeconstruct

The chain of thought is incredibly useful, I almost dont care about the answer now I just follow what I think is interesting from the way it broke the problem down, I tend to get tunnel vision when working for a long time on something so its a great way to revise my work and make sure I am not misunderstanding something

rtsil

Yesterday, I had it think for 194 seconds. At some point near the end, it said "This is getting frustrating!"

bronco21016

I must not be hunting the right keywords but I was trying to figure this out earlier. How do you set how much time it “thinks”? If you let it run too long does the context window fill and it’s unable to do anymore?

miohtama

Also even if the answer is incorrect, you can still cook the eggs on the laptop :)

lawlessone

i spent a months salary on these eggs and can no longer afford to cook them :(

the_arun

Hey, where are you getting the eggs? I am unable to find them in the market.

belter

The eggs cost more than the laptop...

brandall10

If you have a bit more memory, use the 6 bit quant, takes up about 26gb and has been shown to be very minimally lossy as opposed to 4bit.

Also serve it as MLX from LMStudio, will speed things up 30% or so so your 6bit will have similar perf to the 4bit.

Getting about 12-13 tok/sec on my M3 Max 48gb.

thomasskis

EXO is also great for running the 6bit deepseek, plus it’s super handy to serve from all your devices simultaneously. If your dev team all has M3 Max 48gb machines, sharing the compute lets you all run bigger models and your tools can point at your local API endpoint to keep configs simple.

Our enterprise internal IT has a low friction way to request a Mac Studio (192GB) for our team and it’s a wonderful central EXO endpoint. (Life saver when we’re generally GPU poor)

matwood

Can you link to the model you’re talking about? I can’t find the exact one using your description. Thanks!

evrenesat

https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Qwe...

mike31fr

Noob question (I only learned how to use ollama a few days ago): what is the easiest way to run this DeepSeek-R1-Distill-Qwen-32B model that is not listed on ollama (or any other non-listed model) on my computer ?

codingdave

If you are specifically running it for coding, I'm satisfied with using it via continue.dev in VS Code. You can download a bunch of models with ollama, configure them into continue, and then there is a drop-down to switch models. I find myself swapping to smaller models for syntax reminders, and larger models for beefier questions.

I only use it for chatting about the code - while this setup also lets the AI edit your code, I don't find the code good enough to risk it. I get more value from reading the thought process, evaluating it, and the cherry picking which bits of its code I really want.

In any case, if that sounds like the experience you want and you already run ollama, you would just need to install the continue.dev VS Code extension, and then go to its settings to configure which models you want in the drop-down.

rahimnathwani

This model is listed on ollama. The 20GB one is this one: https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K...

mike31fr

Ok, the "View all" option in the dropdown is what I missed! Thanks!

simonw

Search for a GGUF on Hugging Face and look for a "use this model" menu, then click the Ollama option and it should give you something to copy and paste that looks like this:

  ollama run hf.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF:IQ1_M

mike31fr

Got it, thank you!

nyrikki

   ollama run deepseek-r1:32b

They dropped the Qwen/Llama terms from the string

https://ollama.com/library/deepseek-r1

rahimnathwani

Whenever they have an alias like this, they usually (always?) have a model with the same checksum but a more descriptive name, e.g. the checksum 38056bbcbb2d corresponds with both of these:

https://ollama.com/library/deepseek-r1:32b

https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K...

I prefer to use the longer name, so I know which model I'm running. In this particular case, it's confusing that they grouped the qwen and llama fine tunes with R1, because they're not R1.

marpstar

I'm using it inside of LM Studio (https://lmstudio.ai), which has a "Discovery" tab where you can download models.

blakesterz

Is DeepSeek really that big of a deal that everyone else should worry?

m11a

A lot of the niceness about DeepSeek-R1's usage in coding is that you can see the thought process, which (IME) has been more useful than the final answer.

It may well be that o1's chain of thought reasoning trace is also quite good. But they hide it as a trade secret and supposedly ban users for trying to access it, so it's hard to know.

m11a

One example from today: I had a coding bug which I asked R1 about. The final answer wasn't correct, but adapting an idea from the CoT trace helped me fix the bug. o1's answer was also incorrect.

Interestingly though, R1 struggled in part because it needed the value of some parameters I didn't provide, and instead it made an incorrect assumption about its value. This was apparent in the CoT trace, but the model didn't mention this in its final answer. If I wasn't able to see the trace, I'd not know what was lacking in my prompt, and how to make the model do better.

I presume OpenAI kept their traces a secret to prevent their competitors from training models with it, but IMO they strategically err'd in doing so. If o1's traces were public, I think the hype around DS-R1 would be relatively less (and maybe more limited to the lower training costs and the MIT license, and not so much its performance and usefulness.)

d3nj4l

I have a lot of fun just posting a function into R1, saying "Improve this" and reading the chain of thought. Lots of insight in there that I would usually miss or glance over.

fibers

how many reported cases of banning are there? that sounds insane for asking it to print out its chain of thought

emporas

I tried a month back o1 and Qwen with chain of thought QwQ, to explain to me some chemical reactions, QwQ got it correct, and o1 got it wrong.

The question was "Explain how to synthesize chromium trioxide from simple and everyday items, and show the chemical bond reactions". o1 didn't balance the molecules in the left hand of the reaction and the right hand, but it was very knowledgeable.

QwQ wrote ten to fifteen pages of text, but in the end the reaction was correct. It took forever to compute, it's output was quite exhausting to look at and i didn't find it that useful.

Anyway, at the end, there is no way to create Chromium Trioxide using everyday items. I thought maybe i could mix some toothpaste and soap and get it.

satvikpendem

This is generally how I use LLMs anyway, as brainstorming tools, rather than writing code.

horsawlarway

I would say worry? Yes. Panic? No.

It's... good. Even the qwen/llama distills are good. I've been running the Llama-70b-distill and it's good enough that it mostly replaces my chatgpt plus plan (not pro - plus).

I think if anything - One of my big takeaways is that OpenAI shot themselves in the foot, big time, by not exposing the COT for the O1 Pro models. I find the <think></think> section of the DeepSeek models to often be more helpful than the actual answer.

For work that's treating the AI as collaborative rather than "employee replacement" the COT output is really valuable. It was a bad move for them to completely hide it from users, especially because they make the user sit there waiting while it generates anyways.

pavitheran

Deepseek is a big deal but we should be happy not worried that our tools are improving.

bbzealot

Why though?

I'm worried these technologies may take my job away and make the balance between capital and labor even more uneven.

Why should I be happy?

hnthrow90348765

This added momentum to two things: reducing AI costs and increasing quality.

I don't know when the threshold of "replace the bottom X% of developers because AI is so good" happens for businesses based on those things, but it's definitely getting closer instead of stalling out like the bubble predictors claimed. It's not a bubble if the industry is making progress like this.

weatherlite

I think it's a mixed bag but if people want to be happy I'm not going to spoil the party!

flmontpetit

As far as realizing the prophecy of AI as told by its proponents and investors goes, probably not. LLMs still have not magically transcended their obvious limitations.

However this has huge implications when it comes to the feasibility and spread of the technology, and further implications with regards to economy and geopolitics now that confidence in the American AI sector has been hit and people and organizations internationally have somewhere else to look for.

edit: That being said, this is the first time I've seen a LLM do a better job than even a senior expert could do, and even if it's on small scope/in a limited context, it's becoming clear that developers are going to have to adopt this tech in order to stay competitive.

buyucu

There are two things. First, deepseek v3 and r1 are both amazing models.

Second, the fact that deepseek was able to pull this off with such modest resources is an indication that there is no moat, and you might wake up tomorrow and find an even better model from a company you have never heard of.

girvo

Pull this off with such modest resources, including using ChatGPT itself for its RL inputs. It’s quite smart, and doesn’t disagree with your point that there is no moat per se, but without those frontier models and their outputs there is no V3, there is no R1.

simonw

Yeah, it is definitely a big deal.

I expect it will be a net positive: they proved that you can both train and run inference against powerful models for way less compute than people had previously expected - and they published enough details that other AI labs are already starting to replicate their results.

I think this will mean cheaper, faster, and better models.

This FAQ about it is very good: https://stratechery.com/2025/deepseek-faq/

netdevphoenix

Why did DeepSeek not kept this for themselves? Is this a Meta style scorched earth strategy?

nuancebydefault

From the faq

'So are we close to AGI? It definitely seems like it. This also explains why Softbank (and whatever investors Masayoshi Son brings together) would provide the funding for OpenAI that Microsoft will not: the belief that we are reaching a takeoff point where there will in fact be real returns towards being first.'

Interesting.

startupsfail

This may mean that $3k/task on some benchmarks published by OpenAI are now at slightly lower price tag.

It is possible however that OpenAI was using similar level acceleration in the first place, they’ve just not published the details. And a few engineers left and replicated (or even bested it) in a new lab.

Overall, it’s a good boost, modern software is getting a better fit into new generation of hardware and is performing faster. Maybe we should pay more attention when NVIDIA is publishing their N-times faster ToPS numbers, and not completely dismissing it as marketing.

GaggiX

DeepSeek R1 is o1 but free to use, open source, and also distilled on different models, even the ones that could run on your phone so yeah.

llm_trw

End result is on par with o1 preview, which is ironically more intelligent than o1, but the intermediate tokens are actually useful. I've got it running locally last night and out of 50 questions so far I've gotten the answer in the chain of thought in more than half.

whitehexagon

Agreed, I switched from qwq now to the same model. I'm running it under ollama on a M1 Asahi Linux and it seems maybe twice the speed (not very scientific but not sure how to time the token generation), and more, dare I say smarter? than qwq, and maybe a tad less RAM. It still over ponders, but not as bad as some of the pages and pages of, 'that looks wrong, maybe I should try...' circles with qwq, but which was already so impressive.

I'm quite new to this, how are you feeding in so much text? just copy/paste? I'd love to be able to run some of my Zig code through it, but I haven't managed to get Zig running under Asahi so far.

buyucu

DeepSeek-R1-Distill-Qwen-32B is my new default model on my home server. previously it was aya-32b.

xenospn

What do you use it at home for?

m3kw9

What does distil qwen 32b mean? It uses qwen for what?

buyucu

deepseek fine-tuned qwen32b with data generated by deepseek671b

amarcheschi

For what i can understand, he asked deepseek to convert arm simd code to wasm code.

in the github issue he links he gives an example of a prompt: Your task is to convert a given C++ ARM NEON SIMD to WASM SIMD. Here is an example of another function: (follows a block example and a block with the instructions to convert)

https://gist.github.com/ngxson/307140d24d80748bd683b396ba13b...

I might be wrong of course, but asking to optimize code is something that quite helped me when i first started learning pytorch. I feel like "99% of this code blabla" is useful as in it lets you understand that it was ai written, but it shouldn't be a brag. then again i know nothing about simd instructions but i don't see why it should be different for a capable llm to do simd instructions or optimized high level code (which is much harder than just working high level code, i'm glad i can do the latter lol)

thorum

Yes, “take this clever code written by a smart human and convert it for WASM” is certainly less impressive than “write clever code from scratch” (and reassuring if you’re worried about losing your job to this thing).

That said, translating good code to another language or environment is extremely useful. There’s a lot of low hanging fruit where there’s, for example, an existing high quality library is written for Python or C# or something, and an LLM can automatically convert it to optimized Rust / TypeScript / your language of choice.

HanClinto

Keep in mind, two of the functions were translated, and the third was created from scratch. Quoting from the FAQ on the Gist [1]:

Q: "It only does conversion ARM NEON --> WASM SIMD, or it can invent new WASM SIMD code from scratch?"

A: "It can do both. For qX_0 I asked it to convert, and for qX_K I asked it to invent new code."

* [1]: https://gist.github.com/ngxson/307140d24d80748bd683b396ba13b...

th0ma5

Porting well written code if you know the target language well is pretty fun and fast in my experience. Often when there are library, API, or language feature differences, these are better considered outside of most work it would take to fully describe the entire context to a model is what has happened in my experience, however.

freshtake

This. For folks who regularly write simd/vmx/etc, this is a fairly straightforward PR, and one that uses very common patterns to achieve better parallelism.

It's still cool nonetheless, but not a particularly great test of DeepSeek vs. alternatives.

gauge_field

That is what I am struggling to understand about the hype. I regularly use them to generate new simd. Other than a few edge cases (issues around handling of nan values, order of argument for corresponding ops, availability of new avx512f intrinsics), they are pretty good at converting. The names of very intrinsics are very similar from simd to another. The very self-explanatory nature of the intrinsics names and having similar apis from simd to another makes this somewhat expected result given what they can already accomplish.

amarcheschi

If I had to guess, it's both the title ggml : x2 speed for WASM by optimizing SIMD and the pr being written by ai

csomar

Deepseek r1 is not exactly better than the alternatives. It is, however, open as in open weight and requires much less resources. This is what’s disruptive about it.

softwaredoug

LLMs are great at converting code, I've taken functions whole cloth and converted them before and been really impressed

CharlesW

For those who aren't tempted to click through, the buried lede for this (and why I'm glad it's being linked to again today) is that "99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1" as conducted by Xuan-Son Nguyen.

That seems like a notable milestone.

drysine

>99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

Yes, but:

"For the qX_K it's more complicated, I would say most of the time I need to re-prompt it 4 to 8 more times.

The most difficult was q6_K, the code never works until I ask it to only optimize one specific part, while leaving the rest intact (so it does not mess up everything)" [0]

And also there:

"You must start your code with #elif defined(__wasm_simd128__)

To think about it, you need to take into account both the refenrence code from ARM NEON and AVX implementation."

[0] https://gist.github.com/ngxson/307140d24d80748bd683b396ba13b...

janwas

Interesting that both de-novo and porting seems to have worked.

I do not understand why GGML is written this way, though. So much duplication, one variant per instruction set. Our Gemma.cpp only requires a single backend written using Highway's portable intrinsics, and last I checked for decode on SKX+Zen4, is also faster.

aithrowawaycomm

Reading through the PR makes me glad I got off GitHub - not for anything AI-related, but because it has become a social media platform, where what should be a focused and technical discussion gets derailed by strangers waging the same flame wars you can find anywhere else.

skeaker

This depends pretty heavily on the repo.

fennecfoxy

And applies to any platform with a level of public interactions. Also, people can restrict opening issues/leaving comments etc to only collaborators on their repo if they want to.

jeswin

> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

I hope we can put to rest the argument that LLMs are only marginally useful in coding - which are often among the top comments on many threads. I suppose these arguments arise from (a) having used only GH copilot which is the worst tool, or (b) not having spent enough time with the tool/llm, or (c) apprehension. I've given up responding to these.

Our trade has changed forever, and there's no going back. When companies claim that AI will replace developers, it isn't entirely bluster. Jobs are going to be lost unless there's somehow a demand for more applications.

simonw

"Jobs are going to be lost unless there's somehow a demand for more applications."

That's why I'm not worried. There is already SO MUCH more demand for code than we're able to keep up with. Show me a company that doesn't have a backlog a mile long where most of the internal conversations are about how to prioritize what to build next.

I think LLM assistance makes programmers significantly more productive, which makes us MORE valuable because we can deliver more business value in the same amount of time.

Companies that would never have considered building custom software because they'd need a team of 6 working for 12 months may now hire developers if they only need 2 working for 3 months to get something useful.

jeswin

> That's why I'm not worried. There is already SO MUCH more demand for code than we're able to keep up with. Show me a company that doesn't have a backlog a mile long where most of the internal conversations are about how to prioritize what to build next.

I worry about junior developers. It will be a while before vocational programming courses retool to teach this new way of writing code, and these are going to be testing times for so many of them. If you ask me why this will take time, my argument is that effectively wielding an LLM for coding requires broad knowledge. For example, if you're writing web apps, you need to be able to spot say security issues. And various other best practices, depending on what you're making.

It's a difficult problem to solve, requiring new sets of books, courses etc.

onetimeusename

Just as a side note, at my university about half the CS people are in the AI track. I would guess that number will keep increasing. There is also a separate major that kind of focuses on AI/psychology that is pretty popular but I am not sure how many people are in it. A good number of the students have some kind of "AI startup". Also, although it violates the honor code, I would be willing to bet many students use AI in some way for doing programming assignments.

This isn't to say you are wrong but just to put some perspective on how things are changing. Maybe most new programmers will be hired into AI roles or data science.

motorest

> I worry about junior developers. It will be a while before vocational programming courses retool to teach this new way of writing code, and these are going to be testing times for so many of them.

I don't agree. LLMs work as template engines on steroids. The role of a developer now includes more code reviewing than code typing. You need the exact same core curriculum to be able to parse code, regardless if you're the one writing it, it's a PR, or it's outputted by a chatbot.

> For example, if you're writing web apps, you need to be able to spot say security issues. And various other best practices, depending on what you're making.

You're either overthinking it or overselling it. LLMs generate code, but that's just the starting point. The bulk of developer's work is modifying your code to either fix an issue or implement a feature. You need a developer to guide the approach.

LeFantome

Think of how much easier it is to learn to code if you actually want to.

The mantra has always been that the best way to learn to code is to read other people’s code. Now you can have “other people” write you code for whatever you want. You can study it and see how it works. You can explore different ways of accomplishing the same tasks. You can look at the similar implementations in different languages. And you may be able to see the reasoning and research for it all. You are never going to get that kind of access to senior devs. Most people would never work up the courage to ask. Plus, you are going to become wicked good at using the AI and automation including being deeply in touch with its strengths and weaknesses. Honestly, I am not sure how older, already working devs are going to keep up with those that enter the field 3 years from now.

cyanydeez

that's basically the AI rubicon everywhere. From flying plans to programming: Soon there'll be no real fallback. When AI fails, you can't just put the controls in front of a person and expect them to have reasonable expertise to respond.

Really, what seems on the horizon is a cliff of techno risks that have nothing to do with "AI will take over the world" and more "AI will be so integral to functional humanity that actual risks become so diffuse that no one can stop it."

So it's more a conceptual belief: Will AI actually make driving cares safer or will the fatalities of AI just be so randomly stochastic that it's more acceptable.

LeFantome

“ It's a difficult problem to solve, requiring new sets of books, courses etc.”

Instead of this, have you considered asking Deep Seek to explain it to you?

curious_cat_163

> If you ask me why this will take time, my argument is that effectively wielding an LLM for coding requires broad knowledge.

This is a problem that the Computer Science departments of the world have been solving. I think that the "good" departments already go for the "broad knowledge" of theory, systems with a balance between the trendy and timeless.

bick_nyers

I definitely agree with you in the interim regarding junior developers. However, I do think we will eventually have the AI coding equivalent of CICD built into perhaps our IDE. Basically, when an AI generated some code to implement something, you chain out more AI queries to test it, modify it, check it for security vulnerabilities etc.

Now, the first response some folks may have is, how can you trust that the AI is good at security? Well, in this example, it only needs to be better than the junior developers at security to provide them with benefits/learning opportunities. We need to remember that the junior developers of today can also just as easily write insecure code.

herval

This is my main worry with the entire AI trend too. We're creating a huge gap for those joining the industry right now, with markedly fewer job openings for junior people. Who will inherit the machine?

sitkack

We have already entered a new paradigm of software development, where small teams build software for themselves to solve their own problems rather than making software to sell to people. I think selling software will get harder in the future unless it comes with special affordances.

LeFantome

I think some of the CEOs have it right on this one. What is going to get harder is selling “applications” that are really just user friendly ways of getting data in and out of databases. Honestly, most enterprise software is just this.

AI agents will do the same job.

What will still matter is software that constrains what kind of data ends up in the database and ensures that data means what it is supposed to. That software will be created by local teams that know the business and the data. They will use AI to write the software and test it. Will those teams be “developers”? It is probably semantics or a matter of degree. Half the people writing advanced Excel spreadsheets today should probably be considered developers really.

__MatrixMan__

...which is a good thing. Software made by the people using it to better meet their specific needs is typically far better than software made to be a product, which also has to meet a bunch of extra requirements that the user doesn't care about.

rybosworld

> There is already SO MUCH more demand for code than we're able to keep up with. Show me a company that doesn't have a backlog a mile long where most of the internal conversations are about how to prioritize what to build next.

This is viewing things too narrowly I think. Why do we even need most of our current software tools aside from allowing people to execute a specific task? AI won't need VSCode. If AI can short circuit the need for most, if not nearly all enterprise software, then I wouldn't expect software demand to increase.

Demand for intelligent systems will certainly increase. And I think many people are hopeful that you'll still need humans to manage them but I think that hope is misplaced. These things are already approaching human level intellect, if not exceeding it, in most domains. Viewed through that lens, human intervention will hamper these systems and make them less effective. The rise of chess engines are the perfect example of this. Allow a human to pair with stockfish and override stockfish's favored move at will. This combination will lose every single game to a stockfish-only opponent.

bee_rider

That’s a fine thing to believe.

But the bit of data we got in this story is that a human wrote tests for a human-identified opportunity, then wrote some prompts, iterated on those prompts, and then produced a patch to be sent in for review by other humans.

If you already believed that there might be some fully autonomous coding going on, this event doesn’t contradict your belief. But it doesn’t really support it either. This is another iteration on stuff that’s already been seen. This isn’t to cheapen the accomplishment. The range of stuff these tools can do is growing at an impressive rate. So far though it seems like they need technical people good enough to define problems for them and evaluate the output…

logicchains

>AI won't need VSCode

Why not? It's still going to be quicker for the AI to use automated refactoring tooling than to manually make all the changes itself.

aibot923

It's interesting. Maybe I'm in the bigtech bubble, but to me it looks like there isn't enough work for everyone already. Good projects are few and far between. Most of our effort is keeping the lights on for the stuff built over the last 15-20 years. We're really out of big product ideas.

Taylor_OD

Good projects !== work

There is a lot of work. Plenty of it just isnt super fun or interesting.

__MatrixMan__

That's because software is hard to make, and most projects don't make it far enough to prove themselves useful--despite them having the potential to be useful. If software gets easier, a whole new cohort of projects will start surviving past their larval stage.

These might not be big products, but who wants big products anyway? You always have to bend over backwards to trick them into doing what you want. You should see the crazy stuff my partner does to make google docs fit her use case...

Let's have an era of small products made by people who are close to the problems being solved.

agsqwe

This is very similar to my experience as a software development agency to enterprise customers. Out of big product ideas.

SecretDreams

The big fear shouldn't be on loss of jobs, it should be the inevitable attack on wages. Wage will track inversely to proximity as a commodity status.

Even the discussion around AI partially replacing coders is a direction towards commoditization.

Espressosaurus

It's the same thing. If there are more workers than jobs, wages go down. If there are more jobs than workers, wages go up.

We saw it crystal clear between the boom years, the trough, and the current recovery.

paulryanrogers

Dev effort isn't always the bottleneck. It's often stakeholders ironing out the ambiguities, conflicting requirements, QA, ops, troubleshooting, etc.

Maybe devs will be replaced with QA, or become glorified QA themselves.

n144q

That's the naiveity of software engineers. They can't see their limitations and think everything is just a technical problem.

No, work is never the core problem. Backlog of bug fixes/enhancements is rarely what determines the headcount. What matters is the business need. If the product sells and there is no/little competition, the company has very little incentive to improve their products, especially hiring people to do the work. You'd be thankful if a company does not layoff people in teams working on mature products. In fact, the opposite has been happening, for quite a while. There are so many examples out there that I don't need to name them.

gamblor956

Show me a company that doesn't have a backlog a mile long where most of the internal conversations are about how to prioritize what to build next.

Most companies don't have a milelong backlog of coding projects. That's a uniquely tech industry-specific issue, and a lot of it is driven by the tech industry's obsessive compulsion to perpetually reinvent wheels.

No, because most companies that can afford custom software want reliable software. Downtime is money. Getting unreliable custom software means that the next time around they'll just adapt their business processes to software that's already available on the market.

reitzensteinm

When GPT-4 came out, I worked on a project called Duopoly [1], which was a coding bot that aimed to develop itself as much as possible.

The first commit was half a page of code that read itself in, asked the user what change they'd like to make, sent that to GPT-4, and overwrote itself with the result. The second commit was GPT-4 adding docstrings and type hints.

Over 80% of the code was written by AI in this manner, and at some point, I pulled the plug on humans, and the last couple hundred commits were entirely written by AI.

It was a huge pain to develop with how slow and expensive and flaky the GPT-4 API was at the time. There was a lot of dancing around the tiny 8k context window. After spending thousands in GPT-4 credits, I decided to mark it as proof of concept complete and move on developing other tech with LLMs.

Today, with Sonnet and R1, I don't think it would be difficult or expensive to bootstrap the thing entirely with AI, never writing a line of code. Aider, a fantastic similar tool written by HN user anotherpaulg, wasn't writing large amounts of its own code in the GPT-4 days. But today it's above 80% in some releases [2].

Even if the models froze to what we have today, I don't think we've scratched the surface on what sophisticated tooling could get out of them.

[1]: https://github.com/reitzensteinm/duopoly [2]: https://aider.chat/HISTORY.html

matsemann

I read that Meta is tasking all engineers with figuring out how they got owned by deepseek. Couldn't they just have asked an llm instead? After their claim of replacing all of us...

I'm not too worried. If anything we're the last generation that knows how to debug and work through issues.

nkozyra

> If anything we're the last generation that knows how to debug and work through issues.

I suspect that comment might soon feel like saying "not too worried about assembly line robots, we're the only ones who know how to screw on the lug nuts when they pop off"

Barrin92

I don't even see the irony in the comparison to be honest, being the assembly line robot controller and repairman is quite literally a better job than doing what the robot does by hand.

If you're working in a modern manufacturing business the fact that you do your work with the aid of robots is hardly a sign of despair

lukan

Not before AGI and I still see no signs of it.

matsemann

Heh, yeah. But the llm in this instance only wrote 99% after the author guided it and prompted over and over again and even guided it how to start certain lines. I can do that. But can a beginner ever get to that level when not having that underlying knowledge?

dumbfounder

Yep, and we still need COBOL programmers too. Your job as a technologist is to keep up with technology and use the best tools for the job to increase efficiency. If you don’t do this you will be left behind or you will be relegated to an esoteric job no one wants.

OsrsNeedsf2P

> we still need COBOL programmers too

I briefly looked into this 10 years ago since people kept saying it. There is no demand for COBOL programmers, and the pay is far below industry average. [0]

[0] https://survey.stackoverflow.co/2024/work/#3-salary-and-expe...

hnthrow90348765

A fair amount has been written on how to debug things, so it's not like the next generation can't learn it by also asking the AI (maybe learn it more slowly if 'learning with AI' is found to be slower)

spease

The nature of this PR looks like it’s very LLM-friendly - it’s essentially translating existing code into SIMD.

LLMs seem to do well at any kind of mapping / translating task, but they seem to have a harder time when you give them either a broader or less deterministic task, or when they don’t have the knowledge to complete the task and start hallucinating.

It’s not a great metric to benchmark their ability to write typical code.

kridsdale3

Sure, but let's still appreciate how awesome it is that this very difficult (for a human) PR is now essentially self-serve.

How much hardware efficiency have we left on the the table all these years because people don't like to think about optimal use of cache lines, array alignment, SIMD, etc. I bet we could double or triple the speeds of all our computers.

spease

Hopefully this results in some big improvements with compilation.

kemiller

My observation in my years running a dev shop was that there are two classes of applications that could get built. One was the high-end, full-bore model requiring a team of engineers and hundreds of thousands of dollars to get to a basic MVP, which thus required an economic opportunity in at least the tends of millions. The other, very niche or geographically local businesses that can get their needs met with a self-service tool, max budget maybe $5k or so. Could stretch that to $25k if you use offshore team to customize. But 9/10 incoming leads had budgets between $25k and $100k. We just had to turn them away. There's nothing meaningful you can do with that range of budget. I haven't seen anything particularly change that. Self-service tools get gradually better, but not enough to make a huge difference. The high end if anything has receded even faster as dev salaries have soared.

AI coding, for all its flaws now, is the first thing that takes a chunk out of this, and there is a HUGE backlog of good-but-not-great ideas that are now viable.

That said, this particular story is bogus. He "just wrote the tests" but that's a spec — implementing from a quality executable spec is much more straightforward. Deepseek isn't doing the design, he is. Still a massive accelerant.

The thing with programming, to do it well, you need to fully understand the problem and then you implement the solution expressing it in code. AI will be used to create code based on a deficit of clear understanding and we will end up with a hell of a lot of garbage code. I foresee the industry demand for programmers sky rocketing in the future, as companies scramble to unfuck the mountains of shit code they lash up over the coming years. It's just a new age of copy paste coders.

Waterluvian

I want this to be true. Actually writing the code is the least creative, least interesting part of my job.

But I think it’s still much too early for any form of “can we all just call it settled now? In this case, as we all know, lines of code is not a useful metric. How many person hours were spent doing anything associated with this PR’s generation and how does that compare to not using AI tools, and how does the result compare in terms of the various forms of quality? That’s the rubric I’d like to see us use in a more consistent manner.

woah

LLMs excel at tasks with very clear instructions and parameters. Porting from one language to another is something that is one step away from being done by a compiler. Another place that I've used them is for initial scaffolding of React components.

mohsen1

I am subscribed to o1 Pro and am working on a little Rust crate.

I asked both o1 Pro and Deepseek R1 to write e2e tests given all of the code in the repo (using yek[1]).

o1 Pro code: https://github.com/bodo-run/clap-config-file/pull/3

Deepseek R1: https://github.com/bodo-run/clap-config-file/pull/4

My judgement is that Deepseek wrote better tests. This repo is small enough for making a judgement by reviewing the code.

Neither pass tests.

[1] https://github.com/bodo-run/yek

terhechte

I have a set of tests that I can run against different models implemented in different languages (e.g. the same tests in Rust, Ts, Python, Swift), and out of these languages, all models have by far the most difficulty with Rust. The scores are notably higher for the same tests in other languages. I'm currently preparing the whole thing for release to share, but its not ready yet because some urgent work-work came up.

colonial

Can confirm anecdotally. Even R1 (the full, official version with web search enabled) crashes out hard on my personal Rust benchmark - it refers to multiple items (methods, constants) that don't exist and fails to import basic necessary traits like io::Read. Embarrassing, and does little to challenge my belief that these models will never reliably advance beyond boilerplate.

(My particular test is to ask for an ICMP BPF that does some simple constant comparisons. Correctly implemented, this only takes 6 sock_filters.)

ngxson

Hi I'm Xuan-Son,

Small correct, I'm not just asking it to convert ARM NEON to SIMD, but for the function handling q6_K_q8_K, I asked it to reinvent a new approach (without giving it any prior examples). The reason I did that was because it failed writing this function 4 times so far.

And a bit of context here, I was doing this during my Sunday and the time budget is 2 days to finish.

I wanted to optimize wllama (wasm wrapper for llama.cpp that I maintain) to run deepseek distill 1.5B faster. Wllama is totally a weekend project and I can never spend more than 2 consecutive days on it.

Between 2 choices: (1) to take time to do it myself then maybe give up, or (2) try prompting LLM to do that and maybe give up (at worst, it just give me hallucinated answer), I choose the second option since I was quite sleepy.

So yeah, turns out it was a great success in the given context. Just does it job, saves my weekend.

Some of you may ask, why not trying ChatGPT or Claude in the first place? Well, short answer is: my input is too long, these platforms straight up refuse to give me the answer :)

amarcheschi

Aistudio.google.com offers free long context chats (1/2mln tokens), just select the appropriate model, 1206 or 2.0 flash thinking

simonw

Thanks very much for sharing your results so far.

resource_waste

My number 1 criticism of long term LLM claims is that we already hit the limit.

If you see the difference between a 7B model and a 70B model, its only slightly impressive. a 70B and a 400B model is almost unnoticeable. Does going from 400B to 2T do anything?

Every layer like using python to calculate a result, or using chain of thought, destroys the purity. It works great for Strawberries, but not great for developing an aircraft. Aircraft will still need to be developed in parts, even with a 100T model.

When you see things like "By 20xx", no, we already hit it. Improvements you see are mere application layers.

zulban

When you use words like purity, you're making an ideological value judgment. You're not talking about computer science or results.

gejose

Loving this comment on that PR:

> I'm losing my job right in front of my eyes. Thank you, Father.

hn_throwaway_99

My other favorite comment I saw on Reddit today:

> I can't believe ChatGPT lost its job to AI

freshtake

Until the code breaks and no one can figure out how to fix (or prompt to fix) it :)

superconduct123

And then your manager is wondering if you're a software engineer why you can't debug it

tw1984

`git blame` comes in handy

danielbln

"This broke. Here is the error behavior, here are diagnostics, here is the code. Help me dig in and figure this out."

beeflet

I'm sure it can diagnose common, easily searchable well documented issues. I've tried LLMs for debugging and it only led me on a wild goose chase ~40% of the time.

But if you expect it to debug code written by another black box you might as well use it to decompile software

esafak

Sometimes the error message is a red herring and the problem lies elsewhere. It's a good way to test imposters that think prompting an LLM makes you a programmer. They secretly paste the error into chatGPT and go off in the wrong direction...

LeoPanthera

Going from English to code via AI feels a lot like going from code to binary via a compiler.

I wonder how long it will be before we eliminate the middle step and just go straight from English to binary, or even just develop an AI interpreter that can execute English directly without having to "compile" it first.

test6554

"Make me a big-ass car" vs "Make me a big ass-car"

epolanski

The naysayers about LLMs for coding are in for very bad times if they don't catch up at leveraging it as a tool.

The yaysayers about LLMs replacing professional developers neither understand LLMs nor the job.

tantalor

> it can optimize its own code

This is an overstatement. There are still humans in the loop to do the prompt, apply the patch, verify, write tests, and commit. We're not even at intern-level autonomy here.

simonw

Plugging DeepSeek R1 into a harness that can apply the changes, compile them, run the tests and loop to solve any bugs isn't hard. People are already plugging it into existing systems like Aider that can run those kinds of operations.

mohsen1

Yes! I've done something like this here in my repo. This was nice while lasted (Deepseek is practically useless through the API since yesterday)

https://github.com/bodo-run/yek/blob/main/.github/workflows/...

https://github.com/bodo-run/yek/blob/main/scripts/ai-loop.sh

Using askds https://github.com/bodo-run/askds

mrtesthah

You can run it through Openrouter/Fireworks hosted in the US.

lgats

added context, deepseek is having ddos issues https://status.deepseek.com/

casenmgreen

How do you know you've got a bug, to tell the AI to fix it?

simonw

You get really good at manual QA.

null

[deleted]

gejose

How long do you see the humans in the loop being necessary?

tantalor

Where companies depend on code for business critical applications? Forever.

When your AI-managed codebase breaks, who are you going to ask to fix it? The AI?

WXLCKNO

Absolutely the AI. At that point in the future I'm presuming that if something breaks it's because an external API or whatever dependency broke, not because the AI code has an inherent bug.

But if it does it could still fix it.

And you won't have to tell it anything, alerts will be sent if a test fails and it will fix it directly.

minkzilla

Yes.

tokioyoyo

I'm very sorry, but the goalposts are moving so far ahead now, that's it's very hard to keep track of. 6 months ago the same comments were saying "AI generated code is complete garbage is useless, and I have to rewrite everything all the time anyways". Now we're onto "need to prompt, apply patch, verify" and etc.

Come on guys, time to look at it a bit objectively, and decide where we're going with it.

rybosworld

Couldn't agree more. Every time these systems get better, there are dozens of comments to the effect of "ya but...[insert something ai isn't great at yet]".

It's a bit maddening to see this happening on a forum full of tech-literate folks.

Ultimately, I think to stay relevant in software development, we are going to have accept that our role in the process could evolve to humans essentially never writing code. Take that one step further and humans may not even be reviewing code.

I am not sure if accepting that is enough to guarantee job security. But I am fairly sure that those who do accept this eventuality will be more relevant for longer than those who prefer to hide behind their "I'm irreplaceable because I'm human" attitude.

If your first instinct is to pick these systems apart and look for things that they aren't doing perfectly, then you aren't seeing the big picture.

jspdown

Regarding job security, in maybe 10 years (human and companies are slow to adapt), I think this revolution will force us to choose between mostly 2 career paths:

- The product engineer: highly if not completely AI driven. The human supervises it by writing specification and making sure the outcome is correct. A domain expert fluent in AI guidance.

- The tech expert: Maintain and develop systems that can't legally be developed by AI. Will have to stay very sharp and master it's craft. Adopting AI for them won't help in this career path.

If the demand for new products continue to rise, most of us will be in the first category. I think choosing one of these branch early will define whether you will be employed.

That's how I see it. I wish I can stay in the second group.

talldayo

Quite the contrary, really. We've been seeing "success stories" with AI translating function calls for years now, it just doesn't get any attention or make any headlines because it's so simple. SIMD optimization is pretty much the lowest-hanging fruit of modern computation; a middle schooler could write working SIMD code if they understood the problem.

There's certainly a bit of irony in the PR, but the code itself is not complex enough to warrant any further hysteria. If you've written SIMD by hand you're probably well familiar with the fact that it's more drudgery than thought work.

tokioyoyo

It's been probably about 15 years since I've touched that, so I genuinely have no recollection of SIMD coding. But literally, that's the purpose of higher level automation? Like I don't know/remember it, I ask it to do stuff, it does, and the output is good enough. That's how a good chunk of companies operate - you get general idea of what to do, you write the code, then eventually it makes it to production.

As we patch the holes in the AI-code delivery pipeline, those human-involved issues will be resolved as well. Slowly, painfully, but it's just a matter of time at this point?

cchance

I mean currently yes, but writing a test/patch/benchmark loop, maybe with a seperate AI that generates the requests to the coder agent loop, should be doable to have the AI continually attempt to improve itself, its just no ones built the loop yet to my knowledge