Yakread's Ranking Algorithm
10 comments
·April 8, 2025vitus
Hm, I was curious about the biased shuffle, since I was not expecting that particular shape.
It looks like we basically build up the list of bookmarks as follows [0]:
x = rand()
if x < p:
element = first element
else:
element = random element
return [element] + shuffle(rest of list, according to the original order)
In some sense, it's reminiscent of selection sort. Any particular reason for choosing this approach? One "obvious" downside is that this ends up being an O(n*k) algorithm because you end up building a new list each iteration. It's also harder for me to intuitively understand how shuffled the list is based on the parameter, other than at the two extremes.[0] https://github.com/jacobobryant/yakread/blob/bff756c68d86a07...
I'm also curious about the interleaving approach described in the last section:
> If you’ve already scrolled past all your unread bookmarked items several times but you have a bunch of new subscription items, we should probably lean towards recommending the subscription items.
> I do this by comparing the two lists pairwise and selecting an item via weighted random choice based on how many times they’ve been previously skipped (i.e. scrolled past in the For You feed). e.g. if the first bookmark item has been skipped twice and the first subscription item has been skipped once, then there’ll be a 40% chance we select the subscription item and a 60% chance we select the bookmark item.
I thought we wanted to prefer the item that hadn't been skipped as many times (so, prefer the subscription item)?
jacobobryant
For the interleaving, yes we want to prefer the item that's been skipped fewer times. I got the wording backwards in the article; I'll fix that.
For shuffling, I was trying to come up with an approach that would recommend the top k items roughly the same amount regardless of how many total items are in the list. E.g. say you have 10 subscriptions that you really like--I want to have those be a reasonable portion of your recommendations whether you've subscribed to 100 other subs or 1000 other subs.
Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.
The computational complexity ends up being OK because we generally don't actually need to shuffle the whole list. I recommend items in batches of 30, so we just need to get that many items and then we can abort the shuffle. There probably is some more efficient way to implement this though.
During implementation I was mostly thinking of this as "sampling" rather than "shuffling" actually, and just ended up describing it as the latter when I wrote the post.
jacobobryant
Also--I think the pseudo code you have isn't /quite/ correct. If x is greater than p, we don't immediately take a random element from the list; rather we go to the next element and generate a new x and repeat. I.e. with p=0.1, there's a 10% we immediately take the first item, and if we don't do that, then there's a 10% chance we immediately take the second item, etc. we only pick a completely random item as a fallback if we get to the end of the list without picking anything.
andersmurphy
Thank you for sharing. I was hoping you'd eventually publish an article on Yakread's ranking model, I remember you mentioning it tangentially in one of your talks.
Love how concise the code is.
jacobobryant
It's a fun part of the code to work on for sure :).
I started trying out datastar in another part of the app by the way, mostly for the frontend signals so far e.g. https://github.com/jacobobryant/yakread/blob/8a61443235d1f6f...
Since this app isn't collaborative it's not clear to me if there'd be any benefit of doing any long-lived SSE vs. just sticking with the request-response model I'm already doing with htmx. Also since various parts of the app do take 100 - 1000ms to render/query, so rerendering the whole page on any change I don't think would be a good fit. But even just having the signals part baked in is pretty nice.
andersmurphy
Yeah seems super fun. I've added it to my source reading list.
100% you can just do request response. What's nice is it can still be used to notify a user. Say they make a request that kicks off a background job, you can return UI that says the job has started, keep the connection open (cheap with v threads) and then when the job completes return the notification.
I honestly, wouldn't go down the CQRS road unless you are planning on adding multiplayer/coop/global notifications.
For the slow queries in a CQRS context I mostly solve that with caching. You can of course get fancy with datomiclike transaction queue/notification/listen fine grained pub/sub if that's your thing.
jacobobryant
Yeah that makes sense! Once I'm done with the rewrite I'm planning to go through the app and convert the remaining htmx portions to datastar--should be a decent learning exercise.
Fun to see this on the front page! I worked on Yakread full time for about 8 months as an attempted startup, after a few years of other recommender system startup ideas. Now it's a side project that I develop on the weekends after my kids fall asleep, aided by caffeine (me, not the kids). I'm in the middle of open-sourcing/rewriting it. Hopefully will be done in a couple months? Then I can finally get back to adding new features. I talked about some potential ones in my previous post: https://obryant.dev/p/rewriting-yakread/
also I guess a link to the actual app wouldn't hurt: https://yakread.com