Yakread's Ranking Algorithm

jacobobryant

Fun to see this on the front page! I worked on Yakread full time for about 8 months as an attempted startup, after a few years of other recommender system startup ideas. Now it's a side project that I develop on the weekends after my kids fall asleep, aided by caffeine (me, not the kids). I'm in the middle of open-sourcing/rewriting it. Hopefully will be done in a couple months? Then I can finally get back to adding new features. I talked about some potential ones in my previous post: https://obryant.dev/p/rewriting-yakread/

also I guess a link to the actual app wouldn't hurt: https://yakread.com

vitus

Hm, I was curious about the biased shuffle, since I was not expecting that particular shape.

It looks like we basically build up the list of bookmarks as follows [0]:

    x = rand()
    if x < p:
      element = first element
    else:
      element = random element
    return [element] + shuffle(rest of list, according to the original order)

In some sense, it's reminiscent of selection sort. Any particular reason for choosing this approach? One "obvious" downside is that this ends up being an O(n*k) algorithm because you end up building a new list each iteration. It's also harder for me to intuitively understand how shuffled the list is based on the parameter, other than at the two extremes.

[0] https://github.com/jacobobryant/yakread/blob/bff756c68d86a07...

I'm also curious about the interleaving approach described in the last section:

> If you’ve already scrolled past all your unread bookmarked items several times but you have a bunch of new subscription items, we should probably lean towards recommending the subscription items.

> I do this by comparing the two lists pairwise and selecting an item via weighted random choice based on how many times they’ve been previously skipped (i.e. scrolled past in the For You feed). e.g. if the first bookmark item has been skipped twice and the first subscription item has been skipped once, then there’ll be a 40% chance we select the subscription item and a 60% chance we select the bookmark item.

I thought we wanted to prefer the item that hadn't been skipped as many times (so, prefer the subscription item)?

jacobobryant

For the interleaving, yes we want to prefer the item that's been skipped fewer times. I got the wording backwards in the article; I'll fix that.

For shuffling, I was trying to come up with an approach that would recommend the top k items roughly the same amount regardless of how many total items are in the list. E.g. say you have 10 subscriptions that you really like--I want to have those be a reasonable portion of your recommendations whether you've subscribed to 100 other subs or 1000 other subs.

Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.

The computational complexity ends up being OK because we generally don't actually need to shuffle the whole list. I recommend items in batches of 30, so we just need to get that many items and then we can abort the shuffle. There probably is some more efficient way to implement this though.

During implementation I was mostly thinking of this as "sampling" rather than "shuffling" actually, and just ended up describing it as the latter when I wrote the post.

jacobobryant

Also--I think the pseudo code you have isn't /quite/ correct. If x is greater than p, we don't immediately take a random element from the list; rather we go to the next element and generate a new x and repeat. I.e. with p=0.1, there's a 10% we immediately take the first item, and if we don't do that, then there's a 10% chance we immediately take the second item, etc. we only pick a completely random item as a fallback if we get to the end of the list without picking anything.

vitus

Ah, I overlooked the `some`. Interesting, that's even more skewed than I initially thought.

vitus

Thanks for providing more insight into the development process that went into designing this algorithm! :)

> Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.

I personally see that as an argument for having the original position affect the weight.

I experimented with modifying Fisher-Yates to generate a slightly less left-skewed distribution, and decided upon inverse transform sampling as a fairly cheap solution -- basically, with some function f(X): [0,1) -> [0,1), you can transform a uniformly-distributed random variable X ~ Unif[0,1), which then allows you to bias the shuffle without affecting its asymptotic performance.

> There probably is some more efficient way to implement this though.

One potential optimization is to use rejection sampling: keep track of the elements you've seen, then reject any duplicates when pulling out a sample (instead of manually constructing a new list every time). Now, this interacts poorly with your biasing -- in the extreme case where p=1, you'd keep trying to select the first element, then rejecting it. I might then suggest tracking the first element you haven't selected yet, but at some point this ends up being a lot of complexity for asymptotic behavior when what you have is more than good enough for the scale that you're considering.

This also only addresses the "copy the list every iteration of the sampling" aspect of your algorithm's performance, not the fact that with p=0 (or arbitrarily low) you may end up walking the entire list.

(Rejection sampling is more generally using some criterion to reject samples, e.g. if you were trying to select points within a circle by checking x^2 + y^2 < r^2.)

jacobobryant

It is pretty subjective--I design the algorithm with myself in mind first, but I can definitely see a heavier bias toward exploration/variety being better for some others. Something to experiment with.

Thanks for the sampling ideas! Interesting stuff.

andersmurphy

Thank you for sharing. I was hoping you'd eventually publish an article on Yakread's ranking model, I remember you mentioning it tangentially in one of your talks.

Love how concise the code is.

jacobobryant

It's a fun part of the code to work on for sure :).

I started trying out datastar in another part of the app by the way, mostly for the frontend signals so far e.g. https://github.com/jacobobryant/yakread/blob/8a61443235d1f6f...

Since this app isn't collaborative it's not clear to me if there'd be any benefit of doing any long-lived SSE vs. just sticking with the request-response model I'm already doing with htmx. Also since various parts of the app do take 100 - 1000ms to render/query, so rerendering the whole page on any change I don't think would be a good fit. But even just having the signals part baked in is pretty nice.

andersmurphy

Yeah seems super fun. I've added it to my source reading list.

100% you can just do request response. What's nice is it can still be used to notify a user. Say they make a request that kicks off a background job, you can return UI that says the job has started, keep the connection open (cheap with v threads) and then when the job completes return the notification.

I honestly, wouldn't go down the CQRS road unless you are planning on adding multiplayer/coop/global notifications.

For the slow queries in a CQRS context I mostly solve that with caching. You can of course get fancy with datomiclike transaction queue/notification/listen fine grained pub/sub if that's your thing.

jacobobryant

Yeah that makes sense! Once I'm done with the rewrite I'm planning to go through the app and convert the remaining htmx portions to datastar--should be a decent learning exercise.

alexdobrenko

yeaaaaa go jacob go!!! (also hi!)

jacobobryant

ha, hi to you too!

HN

Yakread's Ranking Algorithm

Yakread's Ranking Algorithm