Skip to content(if available)orjump to list(if available)

Show HN: Sort lines semantically using llm-sort

Show HN: Sort lines semantically using llm-sort

13 comments

·February 11, 2025

This is a small plugin I made for Simon Willison's llm utility.

You can do things like:

cat names.txt | llm sort -q "Which one of these names is best for a pet seagull?"

cat books.txt | llm sort -q "Which book is more related to basic vs. advanced CS topics?"

I see a lot of potential marrying LLMs with classic UNIX interfaces.

noperator

I published a nearly identical tool, referencing the same paper, a few weeks ago :) Although I implemented a listwise algorithm instead of pairwise as described in the paper; ends up being a lot faster.

https://github.com/BishopFox/raink

https://bishopfox.com/blog/raink-llms-document-ranking

    raink \
        -f testdata/sentences.txt \
        -r 10 \
        -s 10 \
        -p 'Rank each of these items according to their relevancy to the concept of "time".' |
        jq -r '.[:10] | map(.value)[]' |
        nl
    
       1  The train arrived exactly on time.
       2  The old clock chimed twelve times.
       3  The clock ticked steadily on the wall.
       4  The bell rang, signaling the end of class.
       5  The rooster crowed at the break of dawn.
       6  She climbed to the top of the hill to watch the sunset.
       7  He watched as the leaves fell one by one.
       8  The stars twinkled brightly in the clear night sky.
       9  He spotted a shooting star while stargazing.
      10  She opened the curtains to let in the morning light.

Ey7NFZ3P0nzAe

I recently built a semantic batching function for my RAG system [wdoc](https://github.com/thiswillbeyourgithub/wdoc/) that might be interesting to others. The system splits a corpus into chunks, finds relevant ones via embeddings, and answers questions for each chunk in parallel before aggregating the answers.

To optimize performance and reduce LLM distraction, instead of aggregating answers two by two, it does batched aggregation. The key innovation is in the batching order - I implemented a [semantic_batching function](https://github.com/thiswillbeyourgithub/wdoc/blob/18bc52128f...) that uses hierarchical clustering on the embeddings and orders texts by leaf order.

The implementation was straightforward, runs very fast and produces great results. The function is designed to be usable as a standalone tool for others to experiment with.

ghoul2

Given that the comparison function doesn't have the transitive property (if line a > line b and line b > line c that implies line a > line c), does this sort really work?

The 'allpair' variant, which aggregates the score across all n*(n-1) comparisons will work, but the other two couldn't possibly have stable results - it would depend on the order of the items in the input.

vagozino

Yes, that's correct! All three techniques have their trade-offs. I like the term "stability" you are using for this. There's definitely interesting avenues to go about this trading number of comparisons vs. stability.

eterps

So this is actual sorting (i.e. the same amount of lines appear in the output)?

Otherwise it might be more or less the same as using https://github.com/TheR1D/shell_gpt ?

vagozino

Yeah that's right, the same amount of lines appear in the output. I see the `shell_gpt` project is very similar to the `llm` system, in any case.

adamgordonbell

Have you tried reading the probability of X being the token returned? then you will probably be get better answers and not need to compare every 2.

Log probabilities of output tokens indicate the likelihood of each token occurring given the context.

vagozino

This technique might be more efficient but can be highly correlated to the order of the input text. The paper [1] I mention in the repo touches upon such methods briefly.

[1]: https://arxiv.org/abs/2306.17563

adamgordonbell

Interesting. I'll check out the paper.

It's astoundingly less efficient right? How many compares ( and LLM calls ) to rank 10 items in order? And is it actually stable? You could get a ranking with logprobs in one llm call for 10 items, or do it n=3 times, with a shuffled order and average them out. I'm not sure how to scale to larger sizes of items though.

I guess it depends on how many items you are sorting, but when I think about sorting I think about putting 100+ items in order.

cratermoon

    cat spelling.txt | llm sort -q "how many times does the letter 'r' appear in the word 'strawberry'?"

    cat geology.txt | llm sort -q "which kind of rock is the best to eat, according to Berkeley geologists?"

    cat law-library.txt | llm sort -q "what legal precedents are best for my lawsuit vs Colombian airline Avianca?"

    cat adhesives.txt | llm sort -q "what glue is best for holding the cheese on my pizza?"

scarface_74

Exactly, this entire idea is based on random probability.

I know that’s always the complaint with LLMs. But at least with the paid ChatGPT (the product, not the API), it has access to a Python runtime and web search built in so it could at least attempt to get the some of the answers right.

MPSimmons

O(n^f***)

gntousakis

Great work!