My trick for getting consistent classification from LLMs

axpy906

Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.

sethkim

Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.

jawns

If you already have your categories defined, you might even be able to skip a step and just compare embeddings.

I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.

Then created embeddings for the call notes and matched to closest category using cosine_similarity.

nerdponx

How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?

svachalek

That was my first thought, why even generate tags? Curious to see if anyone's proved it's worse empirically though.

kurttheviking

Out of curiosity, what embedding model did you use for this?

HN

My trick for getting consistent classification from LLMs

My trick for getting consistent classification from LLMs