Skip to content(if available)orjump to list(if available)

Coping with dumb LLMs using classic ML

Coping with dumb LLMs using classic ML

18 comments

·January 22, 2025

dailykoder

This is interesting. I think I did not entirely understand OPs problem, but we are going more and more in a direction where we try to come up with things how to "program" LLMs, because human language is not sufficient enough. (Atleast I thought) the goal was to make things simple and "just" ask your question to an LLM and get the answer, but normal language does not work for complex tasks.

Especially in programming it is fun. People spent hours over hours to come up with a prompt that can (kinda-of) reliably produce code. So they try to hack/program some weird black box so that they can do their actual programming tasks. On some areas there might be a speed up, but I still don't know if it's worth it. It feels like we are creating more problems than solutions

MichaelMoser123

classical ML always runs into the knowledge representation problem - the task is to find some general representation of knowledge suitable for computer reasoning. That's something of a philosophers stone - they keep searching for it for seventy years already.

I think agents will run into the same problem - if they will try to find a classical ML solution to verify what comes out of the LLM.

blueflow

And like the philosophers stone it does not exist. Remember the "Map vs Territory" discussion: you cannot have generic maps, only maps specialized for a purpose.

outofpaper

Yes. All too easily we forget that the maps are not the territories.

LLMs are amazing we are creating better and better hyperdimentional maps of language but until we have systems that are not just crystallized maps of the language they were trained on we will never have something that can really think, let alone AGI or whatever new term we come up with.

GardenLetter27

The example here isn't great, but the idea of using an ensemble of LLMs when compute is cheaper is cool.

As the foundational models can parse super complex stuff like dense human language, music, etc. with context - like a really good pre-built auto-encoder, which would be a nightmare with classic machine learning feature selection (remember bag of words? and word2vec?).

I wonder how such an approach would compare to just fine-tuning one model though? And how the cost of fine-tuning vs. greater inference cost for an ensemble compares?

AJRF

My takeaway is that he didn’t solve anything, he just changed the shape of the problem into one that was familiar to him.

sgt101

Options:

Finetune the models to be better

Optimise the prompts to be better

Train better models

cyanydeez

Possible bug on uber query? --- Which of these product descriptions (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table Product LHS name: aleah coffee table Product LHS description: You'll love this table from lazy boy. It goes in your living room. And you'll find ... ... Or Product LHS name: marta coffee table Product RHS description: This coffee table is great for your entrance, use it to put in your doorway... ... Or Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE: RHS --- LHS is include. Hopefully this is a bug in the blog and not the code

outofpaper

With or without the bug it's a horid prompt. Prompts work best when they resemble content LLMs have in their training data. People use first and second far more often then LHS and RHS when talking about options. First or second, 1 or 2, a or b or neither.

LLMs are narrative machines. They make up stories which often make sense.

Vampiero

Wake me up when LLMs are good at Problog because it's the day we can finally rest

kvgr

The amount of hallucination I get when trying to write code is amazing. I mean it can get the core concepts of language, can create structure/algo. But it often makes up objects/values when I ask questions. Exampe: It suggested TextLayoutResult.size - which is Int value. I asked if it is width and height. And it wrote it has size.height and also size.width. Which it does not. I am now writing production code and also evaluating the LLMs, that our management thinks will save us shit load of time. We will get there sometimes, but the push from management is not compatible with the state of the LLMs. (I use Claude 3.5 sonnet now, as it is also built in some of the "AI IDEs".)

antihipocrat

You're not alone. In my experience the senior executive are enamoured by the possibility of halving headcount. The engineers reporting honestly about the limitations of connecting it to core systems (or using it to generate complex code running on core systems) are at risk of being perceived as blocking progress. So everyone keeps quiet, tries to find a quick and safe use case for the tech to present to management, and make sure that they aren't involved in any project that will be the big one to fail spectacularly and bring it all crashing down.

devvvvvvv

Using AI as a way to flag things for humans to look at and make final decisions on seems like the way to go

Joker_vD

But... that can possibly only make things more expensive than they are now, with dubious improvements in quality?

GardenLetter27

Almost all deployed ML systems work like this.

I.e. for classification you can judge "certainty" by the soft-max outputs of the classifier, then in the less certain cases can refuse to classify and send it to humans.

And also do random sampling of outputs by humans to verify accuracy over time.

It's just that humans are really expensive and slow though, so it can be hard to maintain.

But if humans have to review everything anyway (like with the EU's AI act for many applications) then you don't really gain much - even though the humans would likely just do a cursory rubber-stamp review anyway, as anyone who has seen Pull Request reviews can attest to.

frankc

I have the same experience but I am still 5 to 10 times more productive using claude. I'll have it write a class, have it write tests for the class and give it the output of the tests, from which it usually figures out problems like "oops those methods don't exist". Along the way I am guiding it on the approach and architecture. Sometimes it does get stuck and it needs very specific intervention. You need to be a senior engineer to do this well, In the end I usually get what I want with way more tests than I would have the patience to write and a fraction of the time. Importantly since it now has the context loaded, I can have it write nicely formatted documentation and add bells and whistles like a pretty cli, with minimal effort. In the end I usually get what I want with better tests, docs and polish in a fraction of the time, especially with cursor which makes the iteration process so much faster.

internet_points

I've worked on some projects that used ML and such to half-automate things, thinking that we'd get the computer to do most of the work and people would check over things and it would be quality controlled.

Three problems with this:

* salespeople constantly try to sell the automation as more complete than it is

* product owners try to push us developers into making it more fully automated

* users get lulled into thinking it's more complete than it is (and accepting suggestions instead of deeply thinking through the issues like they would if they had to think things from scratch)

hackerwr

hello do you have a place for me im haking the school now