Coping with dumb LLMs using classic ML

80 comments

·January 22, 2025

napsternxg

We often ignore the importance of using good baseline systems and jump to the latest shiny thing.

I had a similar experience few years back when participating in a ML competitions [1,2] for detecting and typing phrases in a text. I submitted an approach based on Named Enttiy Recognition using Conditional Random Field (CRF) which has been quite robust and well known in the community and my solution beat most of tuned Deep learning solutions by quite a large margin [1].

I think a lot of folks underestimate the complexity of using some of these models (DL, LLM) and just throw them at the problem or don't compare it well against well established baselines.

[1] https://scholar.google.com/citations?view_op=view_citation&h... [2] https://scholar.google.com/citations?view_op=view_citation&h...

PaulHoule

As I see it, you need a model you can train quickly so you can do tuning, model selection, and all that.

I have a BERT + SVM + Logistic Regression (for calibration) model that can train 20 models for automatic model selection and calibration in about 3 minutes. I feel like I understand the behavior of it really well.

I've tried fine tuning a BERT for the same task and the shortest model builds take 30 minutes, the training curves make no sense (back in the day I used to be able to train networks with early stopping and get a good one every time) and if I look at arXiv papers it is rare for anyone to have a model selection process with any discipline at all, mainly people use a recipe that sorta-kinda seemed to work in some other paper. People scoff at you if you ask the engineering-oriented question "What training procedure can I use to get a good model consistently?"

Because of that I like classical ML.

korkybuchek

There's a reason xgboost is still king in large companies.

3eb7988a1663

That's the thing that blows my mind. Even if NN are some percentage better, the training+deployment headaches are not worth it unless you have a billion users where a 0.1% lift equates to millions of dollars.

abhgh

It is pleasantly surprising to see how close your pipeline is to mine. Essentially a good representation layer - usually based on BERT - like minilm or MPNet, followed by a calibrated linear SVM. Sometimes I replace the SVM with LightGBM if I have non-language features.

If I am building a set of models for a domain, I might fine-tune the representation layer. On a per-model basis I typically just train the SVM and calibrate it. For the amount of time this whole pipeline takes (not counting the occasions when I fine-tune), it works amazingly well.

shortrounddev2

I spent a week learning enough ML to design a recommender system that worked well with my company's use case. I knew enough linear algebra to determine that collaborative filtering with some specifically chosen dimensionality reduction and text vectorization algorithms as well as a strategy for scaling the models across multiple databases would work well for us. The solution was tailored specifically to the type of data we were working with.

When I presented the proposal, nobody read it and the meeting immediately turned to the vp of engineering and the ceo discussing neural networks and some other ML system that they had read about on HN the day before. When I tried to bring collaborative filtering up again, the VP said "I don't know what that is", so obviously he hadn't read the doc that I was assigned to write over the last week

sieabahlpark

[dead]

lewisl9029

I had a somewhat similar experience trying to use LLMs to do OCR.

All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2 VL) have been pretty good at extracting text, but they failed miserably at finding bounding boxes, usually just making up random coordinates. I thought this might have been due to internal resizing of images so tried to get them to use relative % based coordinates, but no luck there either.

Eventually gave up and went back to good old PP-OCR models (are these still state of the art? would love to try out some better ones). The actual extraction feels a bit less accurate than the best LLMs, but bounding box detection is pretty much spot on all the time, and it's literally several orders of magnitude more efficient in terms of memory and overall energy use.

My conclusion was that current gen models still just aren't capable enough yet, but I can't help but feel like I might be missing something. How the heck did Anthropic and OpenAI manage to build computer use if their models can't give them accurate coordinates of objects in screenshots?

ahzhou

LLMs are inherently bad at this due to tokenization, scaling, and lack of training on the task. Anthropic’s computer use feature has a specialized model for pixel-counting: > Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. [1] For a VLM trained on identifying bounding boxes, check out PaliGemma [2]

You may also be able to get the computer use API to draw bounding boxes if the costs make sense.

That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.

1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma

nostrebored

PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.

PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.

[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use

HanClinto

Maybe still worth it to separate the tasks, and use a traditional text detection model to find bounding boxes, then crop the images. In a second stage, send those cropped samples to the higher-power LLMs to do the actual text extraction, and don't worry about them for bounding boxes at all.

There are some VLLMs that seem to be specifically trained to do bounding box detection (Moondream comes to mind as one that advertises this?), but in general I wouldn't be surprised if none of them work as well as traditional methods.

parsakhaz

We've run a couple experiments and have found that our open vision language model Moondream works better than YOLOv11 in general cases. If accuracy matters most, it's worth trying our vision language model. If you need real-time results, you can train YOLO models using data from our model. We have a space for video redaction, that is just object detection, on our Hugging Face. We also have a playground online to try it out.

DougBTX

AFAIK none of those models have been trained to produce bounding boxes. On the other hand Gemini Pro has, so it may be worth looking at for your use case:

https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

jonnycoder

I am doing OCR on hundreds of PDFs using AWS Textract. It requires me to convert each page of the pdf to an image and then analyze the image and it works good for converting to markdown format (which requires custom code). I want to try using some vision models and compare how they do, for example Phi-3.5-vision-instruct.

whiplash451

1. You need to look into the OCR-specific literature of DL (e.g. udop) or segmentation-based (e.g. segment-anything)

2. BigTech and SmallTech train their fancy bounding box / detection models on large datasets that have been built using classical detectors and a ton of manual curation

bob1029

> they failed miserably at finding bounding boxes, usually just making up random coordinates.

This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.

nostrebored

The spatial awareness is what grounding models try to achieve, e.g. UGround [1]

[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python

KTibow

Gemini 2 can purportedly do this, you can test it with the Spatial Understanding Starter App inside AI Studio. Only caveat is that it's not production ready yet.

owkman

I think people have had success with using PaliGemma for this. The computer use type use cases probably use fine tuned versions of LLMs for their use cases rather than the base ones.

dailykoder

This is interesting. I think I did not entirely understand OPs problem, but we are going more and more in a direction where we try to come up with things how to "program" LLMs, because human language is not sufficient enough. (Atleast I thought) the goal was to make things simple and "just" ask your question to an LLM and get the answer, but normal language does not work for complex tasks.

Especially in programming it is fun. People spent hours over hours to come up with a prompt that can (kinda-of) reliably produce code. So they try to hack/program some weird black box so that they can do their actual programming tasks. On some areas there might be a speed up, but I still don't know if it's worth it. It feels like we are creating more problems than solutions

flessner

I feel the same way about programming, but there are plenty of people that don't enjoy it.

I recently was chatting with my friend that wanted to automate one of his tasks by writing a python script with AI -> because all the influencers said it was "so easy" and "no programming knowledge" required.

That might have been the single funniest piece of code I have seen in a long time. Didn't install the dependencies, didn't fill in the Twitter API key, instead of searching for a keyword on Twitter it just looked up 3 random accounts, 25 functions on like 120 lines of code?

Also, the line numbers in the errors weren't helpful because the whole thing lived in Windows notepad. That was a flagship AI and a (in my opinion) capable human not being able to assemble a simple script.

PaulHoule

If you have some idea of what good code looks like you can sometimes give feedback to something like Cursor or Windsurf. For small greenfield projects (that kind of downloader script) they succeed maybe 50% of the time.

If you had no idea of what code looks like and poor critical thinking abilities God help you.

Matthyze

So, if I understand the approach correctly: we're essentially doing very advanced feature engineering with LLMs. We find that direct classification by LLMs performs worse than LLM feature engineering followed by decision trees. Am I right?

The finding surprises me. I would expect modern LLMs to be powerful enough to do well at the task. Given how much the data is processed before the decision trees, I wouldn't expect decision trees to add much. I can see value in this approach if you're unable to optimize the LLM. But, if you can, I think end-to-end training with a pre-trained LLM is likely to work better.

softwaredoug

TBH I'm not sure its better, but the decision tree structure is pretty handy for problem exploration

(However 'better' might be defined, I care more about the precision / recall tradeoff)

ellisv

This resonates with my experience. Use LLMs for feature engineering, then use traditional ML for your inference models.

Matthyze

Perhaps the reason that this approach works well is that, while the LLM gives you good general-purpose language processing, the decision tree learns about the specific dataset. And that combination is more powerful than either component.

ellisv

It’s the same reason LLMs don’t perform well on tabular data. (They can do fine but usually not was well as other models)

Performing feature engineering with LLMs and then storing the embeddings in a vector database also allows you to reuse the embeddings for multiple tasks (eg clustering, nearest neighbor).

Generally no one uses plain decision trees since random forest or gradient boosted trees perform better and are more robust.

gerad

It seems like a really easy way to overfit your model to your data, even while using LLMs.

GardenLetter27

The example here isn't great, but the idea of using an ensemble of LLMs when compute is cheaper is cool.

As the foundational models can parse super complex stuff like dense human language, music, etc. with context - like a really good pre-built auto-encoder, which would be a nightmare with classic machine learning feature selection (remember bag of words? and word2vec?).

I wonder how such an approach would compare to just fine-tuning one model though? And how the cost of fine-tuning vs. greater inference cost for an ensemble compares?

AJRF

My takeaway is that he didn’t solve anything, he just changed the shape of the problem into one that was familiar to him.

ebiester

That's how we all solve problems. If this was novel, it would be a paper rather than a blog post.

The meta-strategy of combining LLM and non-LLM techniques is going to be key for getting good results for some time.

AJRF

No I don’t think I agree. There is lots of effort wasted shuffling problems around laterally but not solving for the actual goal, that’s what I am saying.

ccortes

> he just changed the shape of the problem into one that was familiar to him

that's a classic strategy to solve problems

jncfhnb

If you’re going to use classic ML why not just train a model based on the vector representations of the product descriptions?

softwaredoug

Yes that's a great idea, and maybe something I would try next in this series.

cyanydeez

Possible bug on uber query? --- Which of these product descriptions (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table Product LHS name: aleah coffee table Product LHS description: You'll love this table from lazy boy. It goes in your living room. And you'll find ... ... Or Product LHS name: marta coffee table Product RHS description: This coffee table is great for your entrance, use it to put in your doorway... ... Or Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE: RHS --- LHS is include. Hopefully this is a bug in the blog and not the code

outofpaper

With or without the bug it's a horid prompt. Prompts work best when they resemble content LLMs have in their training data. People use first and second far more often then LHS and RHS when talking about options. First or second, 1 or 2, a or b or neither.

LLMs are narrative machines. They make up stories which often make sense.

cyanydeez

LHS might trigger a better parsing window and that window would be model dependent.

softwaredoug

This is a copy/pasted typo, the real prompt begins

> Which of these furniture products is more relevant to the furniture e-commerce search query:

Fixed in the post. Thanks

MichaelMoser123

classical ML always runs into the knowledge representation problem - the task is to find some general representation of knowledge suitable for computer reasoning. That's something of a philosophers stone - they keep searching for it for seventy years already.

I think agents will run into the same problem - if they will try to find a classical ML solution to verify what comes out of the LLM.

blueflow

And like the philosophers stone it does not exist. Remember the "Map vs Territory" discussion: you cannot have generic maps, only maps specialized for a purpose.

Matthyze

That's essentially the No Free Lunch (NFL) theorem, right?

The thing about the NFL theorem is that it assumes an equal weight or probability over each problem/task. It's impossible to find a search/learning algorithm that performs superiorly over another, 'averaged' over all tasks. But—and this is purely my intuition—the problems that humans want to solve, are a very small subset of all possible search/learning problems. And this imbalance allows us to find algorithms that work particularly well on the subset of problems we want to solve.

Coming back to representation and maps. Human understanding/worldview is a good example. Human understanding and worldview is itself a map of reality. This map models certain facts of the world well and other facts poorly. It is optimized for human cognition. But it's still broad enough to be useful for a variety of problems. If this map wasn't useful, we probably wouldn't have evolved it.

The point is, I do think there's a philosopher's pebble, and I do think there's a few free bites of lunch. These can be found in the discrepancy between all theoretically possible tasks and the tasks that we actually want to do.

MichaelMoser123

I don't know. Maps can vary in quality and expressiveness.

Language itself is a kind of map, and it has pretty universal reach.

"No Free Lunch (NFL) theorem" isn't quite mathematics, it is more in the domain of philosophy.

outofpaper

Yes. All too easily we forget that the maps are not the territories.

LLMs are amazing we are creating better and better hyperdimentional maps of language but until we have systems that are not just crystallized maps of the language they were trained on we will never have something that can really think, let alone AGI or whatever new term we come up with.

MichaelMoser123

but Language itself is a kind of map, and it has pretty universal reach.

raghavbali

Maybe I missed something but this is a round about way of doing things where an embedding + ML classifier would have done the job. We don't have to use an LLM just because it can be used IMO

sgt101

Options:

Finetune the models to be better

Optimise the prompts to be better

Train better models

Vampiero

Wake me up when LLMs are good at Problog because it's the day we can finally rest

kvgr

The amount of hallucination I get when trying to write code is amazing. I mean it can get the core concepts of language, can create structure/algo. But it often makes up objects/values when I ask questions. Exampe: It suggested TextLayoutResult.size - which is Int value. I asked if it is width and height. And it wrote it has size.height and also size.width. Which it does not. I am now writing production code and also evaluating the LLMs, that our management thinks will save us shit load of time. We will get there sometimes, but the push from management is not compatible with the state of the LLMs. (I use Claude 3.5 sonnet now, as it is also built in some of the "AI IDEs".)

antihipocrat

You're not alone. In my experience the senior executive are enamoured by the possibility of halving headcount. The engineers reporting honestly about the limitations of connecting it to core systems (or using it to generate complex code running on core systems) are at risk of being perceived as blocking progress. So everyone keeps quiet, tries to find a quick and safe use case for the tech to present to management, and make sure that they aren't involved in any project that will be the big one to fail spectacularly and bring it all crashing down.

null

[deleted]

ZaoLahma

What irks me is how LLMs won't just say "no, it won't work" or "it's beyond my capabilities" and instead just give you "solutions" that are wrong.

Codeium for example will absolutely bend over backwards to provide you with solutions to requests that can't be satisfied, producing more and more garbage for every attempt. I don't think I've ever seen it just say no.

ChatGPT is marginally better and will sometimes tell you straight up that an algorithm can't be rewritten as you suggest, because of ... But sometimes it too will produce garbage in its attempts at doing something impossible that you ask it to do.

genewitch

Two notes: I've never had any say no for code related stuff, but I have it disagree that something exists all the time. In fact I just one deny a Subaru brat exists, twice.

Secondly, if an llm is giving you the runaround it does not have a solution for the prompt you asked and you need either another prompt or another model or another approach to using the model (for vendor lock in like openai)

dingnuts

>What irks me is how LLMs won't just say "no, it won't work" or "it's beyond my capabilities" and instead just give you "solutions" that are wrong.

This is one of the clearest ways to demonstrate that an LLM doesn't "know" anything, and isn't "intelligence." Until an LLM can determine whether its own output is based on something or completely made up, it's not intelligent. I find them downright infuriating to use because of this property.

I'm glad to see other people are waking up

epcoa

> ChatGPT is marginally better and will sometimes tell you straight up that an algorithm can't be rewritten as you suggest

Unfortunately this very often it gets wrong, especially if it involves some multistep process.

swells34

This is a good representation of my experience as well.

At the end of the day, this is because it isn't "writing code" in the sense that you or I do. It is a fancy regurgitation engine, that will output bits of stuff it's seen before that seem related to your question. LLMs are incredibly good at this, but that it also why you can never trust their output.

kvgr

yes, I told Windsurf to copy some code to another folder. And what it did? It "regenerated" the files, in the right folders. But the content was different. Great chaos Agent :D

Vampiero

... I just realized that I would be waking up just to go back to resting.