My trick for getting consistent classification from LLMs
6 comments
·October 13, 2025sethkim
Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.
jawns
If you already have your categories defined, you might even be able to skip a step and just compare embeddings.
I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.
Then created embeddings for the call notes and matched to closest category using cosine_similarity.
nerdponx
How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
svachalek
That was my first thought, why even generate tags? Curious to see if anyone's proved it's worse empirically though.
kurttheviking
Out of curiosity, what embedding model did you use for this?
Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.