Skip to content(if available)orjump to list(if available)

The coming knowledge-work supply-chain crisis

roughly

TFA is right to point out the bottleneck problem for reviewing content - there’s a couple things that compound to make this worse than it should be -

The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces, and you need to review everything with the kind of care with which you review intern contributions.

The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.

That problem leads to the final problem, which is that you need a senior engineer to vet the LLM’s code, but you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing - there’s no way up that ladder except to climb it yourself.

All of this may change in the next few years or the next iteration, but the systems as they are today are a tantalizing glimpse at an interesting future, not the actual present you can build on.

ryandrake

> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces

This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

Traditional computer systems whose outputs relied on probability solved this by including a confidence value next to any output. Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.

bee_rider

What would those probabilities mean in the context of these modern LLMs? They are basically “try to continue the phrase like a human would” bots. I imagine the question of “how good of an approximation is this to something a human might write” could possibly be answerable. But humans often write things which are false.

The entire universe of information consists of human writing, as far as the training process is concerned. Fictional stories and historical documents are equally “true” in that sense, right?

Hmm, maybe somehow one could score outputs based on whether another contradictory output could be written? But it will have to be a little clever. Maybe somehow rank them by how specific they are? Like, a pair of reasonable contradictory sentences that can be written about the history-book setting indicate some controversy. A pair of contradictory sentences, one about history-book, one about Narnia, each equally real to the training set, but the fact that they contradict one another is not so interesting.

sepositus

> But humans often write things which are false.

Not to mention, humans say things that make sense for humans to say and not a machine. For example, one recent case I saw was where the LLM hallucinated having a Macbook available that it was using to answer a question. In the context of a human, it was a totally viable response, but was total nonsense coming from an LLM.

MyOutfitIsVague

That's not a "fatal" flaw. It just means you have to manually review every output. It can still save you time and still be useful. It's just that vibe coding is stupid for anything that might ever touch production.

wjholden

The confidence value is a good idea. I just saw a tech demo from F5 that estimated the probability that a prompt might be malicious. The administrator parameterized the tool as a probability and the logs capture that probability. Could be a useful output for future generative AI products to include metadata about uncertainty in their outputs

Aurornis

> This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

Sounds a lot like most engineers I’ve ever worked with.

There are a lot of people utilizing LLMs wisely because they know and embrace this. Reviewing and understanding their output has always been the game. The whole “vibe coding” trend where you send the LLM off to do something and hope for the best will teach anyone this lesson very quickly if they try it.

agentultra

Most engineers you worked with probably cared about getting it right and improving their skills.

null

[deleted]

exe34

> Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.

That's not how they work - they don't have internal models where they are sort of confident that this is a good answer. They have internal models where they are sort of confident that these tokens look like they were human generated in that order. So they can be very confident and still wrong. Knowing that confidence level (log p) would not help you assess.

There are probabilistic models where they try to model a posterior distribution for the output - but that has to be trained in, with labelled samples. It's not clear how to do that for LLMs at the kind of scale that they require and affordably.

You could consider letting it run code or try out things in simulations and use those as samples for further tuning, but at the moment, this might still lead them to forget something else or just make some other arbitrary and dumb mistake that they didn't make before the fine tuning.

nkrisc

How would a meaningful confidence value be calculated with respect to the output of an LLM? What is “correct” LLM output?

Kinrany

It can be the probability of the response being accepted by the prompter

rustcleaner

>That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

I think I can confidently assert that this applies to you and I as well.

ryandrake

I choose a computer to do a task because I expect it to be much more accurate, precise, and deterministic than a human.

n_ary

Honestly, I am surprised by your opinion on this matter(something also echoed a few times in other comments too). Lets switch the context for a bit… human drivers kill few thousand people, so why make so much regulations for self driving cars… why not kick out pilots entirely, autopilot can do smooth(though damaging to tires) landing/takeoffs, how about we layoff all govt workers and regulatory auditors, LLMs are better at recall and most of those paper pushers do subpar work anyways…

My analogies may sound apples to gorillas comparison but the point of automation is that they perform 100x better than human with highest safety. Just because I can DUI and get a fine does not mean a self driving car should drive without fully operational sensors, both bear same risk of killing people but one has higher regulatory restrictions.

devnull3

> hypervigilant

If a tech works 80% of the time, then I know that I need to be vigilant and I will review the output. The entire team structure is aware of this. There will be processes to offset this 20%.

The problem is that when the AI becomes > 95% accurate (if at all) then humans will become complacent and the checks and balances will be ineffective.

hnthrow90348765

80% is good enough for like the bottom 1/4th-1/3rd of software projects. That is way better than an offshore parasite company throwing stuff at the wall because they don't care about consistency or quality at all. These projects will bore your average HNer to death rather quickly (if not technically, then politically).

Maybe people here are used to good code bases, so it doesn't make sense that 80% is good enough there, but I've seen some bad code bases (that still made money) that would be much easier to work on by not reinventing the wheel and not following patterns that are decades old and no one does any more.

Ferret7446

We are already there. The threshold is much closer to 80% for average people. For average folks, LLMs have rapidly went from "this is wrong and silly" to "this seems right most of the time so I just trust it when I search for info" in a few years.

FeepingCreature

> The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.

However, this creates a significant return on investment for opensourcing your LLM projects. In fact, you should commit your LLM dialogs along with your code. The LLM won't learn immediately, but it will learn in a few months when the next refresh comes out.

samjewell

> In fact, you should commit your LLM dialogs along with your code.

Wholeheartedly agree with this.

I think code review will evolve from "Review this code" to "Review this prompt that was used to generate some code"

Havoc

That may be true but the cost of refactoring code that is wrong also plummets.

So even if 9 out of 10 is wrong you can just can it.

xg15

The intro sentence to this is quite funny.

> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?

I actually don't, because so far this only happened with trivial phrases or text I had already typed in the past. I do remember however dozens of times where autocorrect wrongly "corrected" the last word I typed, changing an easy to spot typo into a much more subtle semantic error.

thechao

I see these sorts of statements from coders who, you know, aren't good programmers in the first place. Here's the secret that I that I think LLM's are uncovering: I think there's a lot of really shoddy coders out there; coders who could could/would never become good programmers and they are absolutely going to be replaced with LLMs.

I don't know how I feel about that. I suspect it's not going to be great for society. Replacing blue collar workers for robots hasn't been super duper great.

rowanajmarshall

> Replacing blue collar workers for robots hasn't been super duper great.

That's just not true. Tractors, combine harvesters, dishwashers washing machines, excavators, we've repeatedly revolutionised blue-collar work, made it vastly, extraordinary more efficient.

bendigedig

Validating the outputs of a stochastic parrot sounds like a very alienating job.

darth_avocado

As a staff engineer, it upsets me if my Review to Code ratio goes above 1. Days when I am not able to focus and code, because I was reviewing other people’s work all day, I usually am pretty drained but also unsatisfied. If the only job available to engineers becomes “review 50 PRs a day, everyday” I’ll probably quit software engineering altogether.

moosedev

Feeling this too. And AI is making it "worse".

Reviewing human code and writing thoughtful, justified, constructive feedback to help the author grow is one thing - too much of this activity gets draining, for sure, but at least I get the satisfaction of teaching/mentoring through it.

Reviewing AI-generated code, though, I'm increasingly unsure there's any real point to writing constructive feedback, and I can feel I'll burn out if I keep pushing myself to do it. AI also allows less experienced engineers to churn out code faster, so I have more and more code to review.

But right now I'm still "responsible" for "code quality" and "mentoring", even if we are going to have to figure out what those things even mean when everyone is a 10x vibecoder...

Hoping the stock market calms down and I can just decide I'm done with my tech career if/when this change becomes too painful for dinosaurs like me :)

acedTrex

I could not agree more.

> AI also allows less experienced engineers to churn out code faster, so I have more and more code to review

This to me has been the absolute hardest part of dealing with the post LLM fallout in this industry. It's been so frustrating for me personally I took to writing my thoughts down in a small blog, in fact I say nearly this exact sentiment in it.

https://jaysthoughts.com/aithoughts1

kmijyiyxfbklao

> As a staff engineer, it upsets me if my Review to Code ratio goes above 1.

How does this work? Do you allow merging without reviews? Or are other engineers reviewing code way more than you?

darth_avocado

Sorry I wrote that in haste. I meant it in terms of time spent. In absolute number of PRs, you’d probably be reviewing more PRs than you create.

PaulRobinson

Most knowledge work - perhaps all of it - is already validating the output of stochastic parrots, we just call those stochastic parrots "management'.

FeepingCreature

It's actually very fun, ime.

bendigedig

I have plenty of experience doing code reviews and to do a good job is pretty hard and thankless work. If I had to do that all day every day I'd be very unhappy.

causal

A few articles like this have hit the front page, and something about them feels really superficial to me, and I'm trying to put my finger on why. Perhaps it's just that it's so myopically focused on day 2 and not on day n. They extrapolate from ways AI can replace humans right now, but lack any calculus which might integrate second or third order effects that such economic changes will incur, and so give the illusion that next year will be business as usual but with AI doing X and humans doing Y.

danielmarkbruce

Why: they assume that humans have some secret sauce. Like... judgement...we don't. Once you extrapolate, yes, many things will be very very different.

joshdavham

> What I see happening is us not being prepared for how AI transforms the nature of knowledge work and us having a very painful and slow transition into this new era.

I would've liked for the author to be a bit specific here. What exactly could this "very painful and slow transition" look like? Any commenters have any idea? I'm genuinely curious.

Animats

> This pile of tasks is how I understand what Vaughn Tan refers to as Meaningmaking: the uniquely human ability to make subjective decisions about the relative value of things.

Why is that a "uniquely human ability"? Machine learning systems are good at scoring things against some criterion. That's mostly how they work.

atomicnumber3

How are the criterion chosen though?

Something I learned from working alongside data scientists and financial analysts doing algo trading is that you can almost always find great fits for your criteria, nobody ever worries about that. Its coming up with the criteria that's what everyone frets over, and even more than that, you need to beat other people at doing so - just being good or event great isn't enough. Your profit is the delta between where you are compared to all the other sharks in your pool. So LLMs are useless there, getting token predicted answers is just going to get you the same as everyone else, which means zero alpha.

So - I dunno about uniquely human? But there's definitely something here where, short of AGI, there's always going to need to be someone sitting down and actually beating the market (whatever that metaphor means for your industry or use case).

fwip

Finance is sort of a unique beast in that the field is inherently negative-sum. The profits you take home are always going to be profits somebody else isn't getting.

If you're doing like, real work, solving problems in your domain actually adds value, and so the profits you get are from the value you provide.

kaashif

If you're algo trading then yes, which is what the person you're replying to is talking about.

But "finance" is very broad and covers very real and valuable work like making loans and insurance - be careful not to be too broad in your condemnation.

atomicnumber3

This is an overly simplistic view of algo trading. It ignores things like market services, the very real value of liquidity, and so on.

Also ignores capital gains - and small market moves are the very mechanism by which capital formation happens.

rukuu001

I think this is challenging because there’s a lot of tacit knowledge involved, and feedback loops are long and measurement of success ambiguous.

It’s a very rubbery, human oriented activity.

I’m sure this will be solved, but it won’t be solved by noodling with prompts and automation tools - the humans will have to organise themselves to externalise expert knowledge and develop an objective framework for making ‘subjective decisions about the relative value of things’.

eezurr

And once the Orient and Decide part is augmented, then we'll be limited by social networks (IRL ones). Every solo founder/small biz will have to compete more and more for marketing eyeballs, and the ones who have access to bigger engines (companies), they'll get the juice they need, and we come back to humans being the bottlenecks again.

That is, until we mutually decide on removing our agency from the loop entirely . And then what?

jasonthorsness

The method of producing the work can be more important (and easier to review) than the work output itself. Like at the simplest level of a global search-replace of a function name that alters 5000 lines. At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work. My hope is the current crises of reviewing too much AI-generated output will subside into the way you can trust the team because the LLM has reached a high level of “judgement” and competence. But we’re definitely not there yet.

And contrary to the article, idea-generation with LLM support can be fun! They must have tested full replacement or something.

wffurr

>> At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work

I see you have never managed an outsourced project run by a body shop consultancy. They check the boxes you give them with zero thought or regard to the overall project and require significant micro managing to produce usable code.

jdlshore

I find this sort of whataboutism in LLM discussions tiring. Yes, of course, there are teams of humans that perform worse than an LLM. But it obvious to all but the most hype-blinded booster that it is possible for teams of humans to work autonomously to produce good results, because that is how all software has been produced to the present day, and some of it is good.

kaycebasques

This section heading from the post captures the key insight, is more focused, and is less hyperbolic:

> Redesigning for Decision Velocity

timewizard

> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?

No.

> Multiply that by a thousand and aim it at every task you once called “work.”

If you mean "menial labor" then sure. The "work" I do is not at all aided by LLMs.

> but our decision-making tools and rituals remain stuck in the past.

That's because LLMs haven't eliminated or even significantly reduced risk. In fact they've created an entirely new category of risk in "hallucinations."

> we need to rethink the entire production-to-judgment pipeline.

Attempting to do this without accounting for risk or how capital is allocated into processes will lead you into folly.

> We must reimagine knowledge work as a high-velocity decision-making operation rather than a creative production process.

Then you will invent nothing new or novel and will be relegated to scraping by on the overpriced annotated databases of your direct competitors. The walled garden just raised the stakes. I can't believe people see a future in it.

exmicrosoldier

This is the same problem as outsourcing to third party programmers in another country, but worse.

stego-tech

It really, really is at present. It’s outsourcing but without the benefit of someone getting a paycheck: all exploitation.

RevEng

The article rightly points out that people don't enjoy just being reviewers: we like to take an active role in playing, learning, and creating. They point out the need to find a solution to this, but then never follow up on that idea.

This is perhaps the most fundamental problem. In the past, tools took care of the laborious and tedious work so we could focus on creativity. Now we are letting AI do the creative work and asking humans to become managers and code reviewers. Maybe that's great for some people, but it's not what most problem solvers want to be doing. The same people who know how to judge such things are the same people who have years of experience doing this things. Without that experience you can't have good judgement.

Let the AI make it faster and easier for me to create; don't make it replace what I do best and leave me as a manager and code reviewer.

The parallels with grocery checkouts are worth considering. Humans are great at recognizing things, handling unexpected situations, and being friendly and personable. People working checkouts are experts at these things.

Now replace that with self serve checkouts. Random customers are forced to do this all themselves. They are not experts at this. The checkouts are less efficient because they have to accommodate these non-experts. People have to pack their own bags. And they do all of this while punching buttons on a soulless machine instead of getting some social interaction in.

But worse off is the employee who manages these checkouts. Now instead of being social, they are security guards and tech support. They are constantly having to shoot the computer issues and teach disinterested and frustrated beginners how to do something that should be so simple. The employee spends most of their time as a manager and watchdog, looking at a screen that shows the status of all the checkouts, looking for issues, like a prison security guard. This work is inactive and unengaging, requiring constant attention - something humans aren't good at. When little they do interact with others, it is in situations where that are upset.

We didn't automate anything here, we just changed who does what. We made customers into the people doing checkouts and we made more level staff into managers of them, plus being tech support.

This is what companies are trying to do with AI. They want to have fewer employees whose job it is to manage the AIs, directing them to produce. The human is left assigning tasks and checking the results - managers of thankless and soulless machines. The credit for the creation goes to the machines while the employees are seen as low skilled and replaceable.

And we end up back at the start: trying to find high skilled people to perform low skilled work based on experience that they only would have had if they had being doing high skilled work to begin with. When everyone is just managing an AI, no one will know what it is supposed to do.