Lines of code that beat A/B testing (2012)
164 comments
·January 9, 2025awkward
xp84
> politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users.
I just want to drop here the anecdata that I've worked for a total of about 10 years in startups that proudly call themselves "data-driven" and which worshipped "A/B testing." One of them hired a data science team which actually did some decently rigorous analysis on our tests and advised things like when we had achieved statistical significance, how many impressions we needed to have, etc. The other did not and just had someone looking at very simple comparisons in Optimizely.
In both cases, the influential management people who ultimately owned the decisions would simply rig every "test" to fit the story they already believed, by doing things like running the test until the results looked "positive" but not until it was statistically significant. Or, by measuring several metrics and deciding later on to make the decision based on whichever one was positive [at the time]. Or, by skipping testing entirely and saying we'd just "used a pre/post comparison" to prove it out. Or even by just dismissing a 'failure,' saying we would do it anyway because it's foundational to X, Y, and Z which really will improve (insert metric) The funny part is that none of these people thought they were playing dirty, they believed that they were making their decisions scientifically!
Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.
mikepurvis
At a small enough scale, gut feelings can be totally reasonable; taste is important and I'd rather follow an opinionated leader with good taste than someone who sits on their hands waiting for "the data". Anyway, your investors want you to move quickly because they're A/B testing you for surviveability against everything else in their portfolio.
The worst is surely when management make the investments in rigor but then still ignores the guidance and goes with their gut feelings that were available all along.
legendofbrando
Huge plus one to this. We undervalue when to bet on data and when to be comfortable with gut.
weitendorf
I think your management was acting more competently than you are giving them credit for.
If A/B testing data is weak or inconclusively, and you’re at a startup with time/financial pressure, I’m sure it’s almost always better to just make a decision and move on than to spend even more time on analysis and waiting to achieve some fixed level of statistical power. It would be a complete waste of time for a company with limited manpower that needs to grow 30% per year to chase after marginal improvements.
zelphirkalt
One shouldn't claim to be "data-driven", when one doesn't have a clue what that means. Just admit, that you will follow the leader's gut feeling at this company then.
DanielHB
I worked at an almost-medium-sized company and we did quite a lot of A/B testing. In most cases the data people would be like "no meaningful difference in user behaviour". Going by gut feeling and overall product metrics (like user churn) turns out to be pretty okay most of the time.
The one place that A/B testing seem to have a huge impact was on the acquisition flow and onboarding, but not in the actual product per se.
petesergeant
> Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.
see also Scrum and Agile. Or continuous deployment. Or anything else that's hard to do well, and easier to just cargo-cult some results on and call it done.
throwup238
> In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements.
I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results. I figure by the law of probabilities they would have gotten at least a single significant experiment but most products have such small user bases and make such large changes at a time that it’s completely pointless.
All my complaints fell on deaf ears until the PM in charge would get on someone’s bad side and then that metric would be used to push them out. I think they’re largely a political tool like all those management consultants that only come in to justify an executive’s predetermined goals.
dkarl
> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
What I've seen in practice is that some places trust their designers' decisions and only deploy A/B tests when competent people disagree, or there's no clear, sound reason to choose one design over another. Surprise surprise, those alternatives almost always test very close to each other!
Other places remove virtually all friction from A/B testing and then use it religiously for every pixel in their product, and they get results, but often it's things like "we discovered that pink doesn't work as well as red for a warning button," stuff they never would have tried if they didn't have to feed the A/B machine.
From all the evidence I've seen in places I've worked, the motivating stories of "we increased revenue 10% by a random change nobody thought would help" may only exist in blog posts.
DanielHB
In paid SaaS B2B A/B testing is usually a very good idea for use acquisition flow and onboarding, but not in the actual product per se.
Once the user has committed to paying they probably will put up with whatever annoyance you put in their way, also if they are paying if something is _really_ annoying they often contact the SaaS people.
Most SaaS don't really care that much about "engagement" metrics (ie keeping users IN the product). These are the kinda of metrics are are the easiest to see move.
In fact most people want a product they can get in and out ASAP and move on with their lives.
zeroCalories
I think trusting your designers is probably the way to go for most teams. Good designers have solid intuitions and design principles for what will increase conversion rates. Many designers will still want a/b tests because they want to be able to justify their impact, but they should probably be denied. For really important projects designers should do small sample size research to validate their designs like we would do in the past.
I think a/b tests are still good for measuring stuff like system performance, which can be really hard to predict. Flipping a switch to completely change how you do caching can be scary.
eru
> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
Well, at least it looks like they avoided p-hacking to show more significance than they had! That's ahead of much of science, alas.
rco8786
> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
Yea, I've been here too. And in every analytics meeting everyone went "well, we know it's not statistically significant but we'll call it the winner anyway". Every. Single. Time.
Such a waste of resources.
hamandcheese
Is it a waste? You proved the change wasn't harmful.
ljm
Tracks that I’ve primarily seen A/B tests used as a mechanism for gradual rollout rather than pure data-driven experimentation. Basically expose functionality to internal users by default then slowly expand it outwards to early adopters and then increment it to 100% for GA.
It’s helpful in continuous delivery setups since you can test and deploy the functionality and move the bottleneck for releasing beyond that.
baxtr
I wouldn’t call that A/B testing but rather a gradual roll-out.
Someone
I think gradual rollout can use the same mechanism, but for a different readon: avoiding pushing out a potentially buggy product to all users in one sweep.
It becomes an A/B test when you measure user activity to decide whether to roll out to more users.
alex7o
I think parent is confusing A/B testing with feature flags, which can be used for A/B tests but also for roll-outs.
cornel_io
If you roll it back upon seeing problems, then you're doing something meaningful, at least. IMO 90+% of the value of A/B testing comes from two things, a) forcing engineers to build everything behind flags, and b) making sure features don't crater your metrics before freezing them in and making them much more difficult to remove (both politically and technically).
Re: b), if you've ever gotten into a screaming match with a game designer angry over the removal of their pet feature, you will really appreciate the political cover that having numbers provides...
hamandcheese
I feel like you are trying to say "sometimes people just need a feature flag". Which is of course true.
necovek
A feature flag can still be fully on or fully off.
Why they might conflate A/B testing with gradual rollout is control over who gets the feature flag on and who doesn't.
In a sense, A/B testing is a variant of gradual rollout, where you've done it so you can see differences in feature "performance" (eg. funnel dashboards) vs just regular observability (app is not crashing yet).
Basically, a gradual rollout for the purposes of an A/B test.
kavenkanum
Derisking changes may not work sometimes. For example I don't use Spotify anymore, because of their ridiculous Ab tests. In one month I saw 3 totally different designs of the home and my fav playlists page on my Android phone. That's it. When you open Spotify only when you start your car then it's ridiculous that you can't find anything and you are in a hurry. That was it. I am no longer subscriber and a user of this shit service. Sometimes these tests are actually harmful. Maybe others are just driving and trying to manage Spotify at the same time and then we have actual killed people because of this. Harmless Indeed.
Moru
Ever tried a support call helping someone using a website that has A/B testing on? It's a very frustrating experience where you start to think the user on the other side must have mistyped the url. A lot of time wasted on such calls. And yes, the worst is when these things only happen in things you use only when in a hurry.
mewpmewp2
Why do you consider it political. Isn't it just a wise thing to do?
sweezyjeezy
One of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce. The author is dismissive and hard wavy about this and having worked in in e-commerce SaaS I'd be a bit more cautious.
Imagine that you are running MAB on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more, say 60/40. You now start running a sale - and the conversion rate for both sides goes up equally. But since you are now sampling more from the treatment variant, its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
Fluctuating reward rates are everywhere in e-commerce, and tend to destabilise MAB proportions, even on two identical variants, they can even cause it to lean towards the wrong one. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.
necovek
> ...the conversion rate for both sides goes up equally.
If the conversion rate "goes up equally", why did you not measure this and use that as a basis for your decisions?
> its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
This sounds simply like using bad math. Wouldn't this kill most experiments that start with 10% for the variant that do not provide 10x the improvement?
ted_dunning
No. This isn't just bad math.
The problem here is that the weighting of the alternatives changes over time and the thing you are measuring may also change. If you start by measuring the better option, but then bring in the worse option in a better general climate, you could easily conclude the worse option is better.
To give a concrete example, suppose you have two versions of your website, one in English and one in Japanese. Worldwide, Japanese speakers tend to be awake at different hours than English speakers. If you don't run your tests over full days, you may bias the results to one audience or the other. Even worse, weekend visitors may be much different than weekday visitors so you may need to slow down to full weeks for your tests.
Changing tests slowly may mean that you can only run a few tests unless you are looking at large effects which will show through the confounding effects.
And that leads back to the most prominent normal use which is progressive deployments. The goal there is to test whether the new version is catastrophically worse than the old one so that as soon as you have error bars that bound the new performance away from catastrophe, you are good to go.
necovek
I mean, sure you could test over only part of the day, but if you do, that is, imho, bad math.
Eg. I could sum up 10 (decimal) and 010 (octal) as 20, but because they were the same digits in different numbering systems, you need to normalize the values first to the same base.
Or I could add up 5 GBP, 5 USD, 5 EUR and 5 JPY and claim I got 20 of "currency", but it doesn't really mean anything.
Otherwise, we are comparing incomparable values, and that's bad math.
Sure, percentages is what everybody gets wrong (hey percentage points vs percentage), but that does not make them not wrong. And knowing what is comparable when you simply talk in percentages, even more so (as per your examples).
hinkley
It is a universal truth that people fuck up statistical math.
There are three kinds of lies: lies, damn lies, and statistics
If you aren’t testing at exactly 50/50 - and you can’t because my plan for visiting a site and for how long will never be equivalent to your plan, then any other factors that can affect conversion rate will cause one partition to go up faster than the other. You have to test at a level of Amazon to get statistical significance anyway.And as many if us have told people until they’re blue in the face: we (you) are not a FAANG company and pretending to be one won’t work.
abetusk
One of the other comment threads has a link to a James LeDoux post about MAB with EG, UCB1, BUCB and EXP3, with EXP3, from what I've seen, marketed as an "adversarial" MAB method [0] [1].
I found a post [2] of doing some very rudimentary testing on EXP3 against UCB to see if it performs better in what could be considered an adversarial environment. From what I can tell, it didn't perform all that well.
Do you, or anyone else, have an actual use case for when EXP3 performs better than any of the standard alternatives (UCB, TS, EG)? Do you have experience with running MAB in adversarial environments? Have you found EXP3 performs well?
[0] https://news.ycombinator.com/item?id=42650954#42686404
[1] https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
[2] https://www.jeremykun.com/2013/11/08/adversarial-bandits-and...
hinkley
Motivations can vary on a diurnal basis too. Or based on location. It means something different if I’m using homedepot.com at home or standing in an aisle at the store.
And with physical retailers with online catalogs, an online sale of one item may cannibalize an in-store purchase of not only that item but three other incidental purchases.
But at the end of the day your 60/40 example is just another way of saying: you don’t try to compare two fractions with a different denominator. It’s a rookie mistake.
alex5207
Good point about fluctuating rates for e.g the sales period. But couldn't you then pick a metric that doesn't fluctuate?
Out of curiosity, where did you work? In the same space as you.
ertdfgcvb
I don't follow. In this case would sampling 50/50 always give better/unbiased results on the experiment?
sweezyjeezy
Sampling 50/50 will always give you the best chance of picking the best ultimate 'winner' in a fixed time horizon, at the cost of only sampling the winning variant 50% of the time. That's true if the reward rates are fixed or not. But some changes in reward rates will also cause MAB aggregate statistics to skew in a way that they shouldn't for a 50/50 split yeah.
zeroCalories
What do you think of using the epsilon-first approach then? We could explore for that fixed time horizon, then start choosing greedy after that. I feel like the only downside is that adding new arms becomes more complicated.
lern_too_spel
Yes.
tjbai
I agree that there's an exploration-exploitation tradeoff, but for what you specifically suggest wouldn't you presumably just normalize by sample size? You wouldn't allocate based off total conversions, but rather a percentage.
jvans
Imagine a scenario where option B does 10x better than option A during the morning hours but -2x worse the rest of the day. If you start the multi armed bandit in the morning it could converge to option B quickly and dominate the rest of the day even though it performs worse then.
Or in the above scenario option B performs a lot better than option A but only with the sale going, otherwise option B performs worse.
hinkley
One of the problems we caught only once or twice: mobile versus desktop shifting with time of day, and what works on mobile may work worse than on desktop.
We weren’t at the level of hacking our users, just looking at changes that affect response time and resource utilizations, and figuring out why a change actually seems to have made things worse instead of better. It’s easy for people to misread graphs. Especially if the graphs are using Lying with Statistics anti patterns.
sweezyjeezy
Yes but here's a exaggerated version - say were to sample for a week at 50/50 when the base conversion rate was at 4%, then we sample at 25/75 for a week with the base conversion rate bumped up to 8% due to a sale.
The average base rate for the first variant is 5.3%, the second is 6.4%. Generally the favoured variant's average will shift faster because we are sampling it more.
necovek
Uhm, this still sounds like just bad math.
While it's non-obvious this is the effect, anyone analyzing the results should be aware of it and should only compare weighted averages, or per distinct time periods.
And therein is the largest problem with A/B testing: it's mostly done by people not understanding the math subtleties, thus they will misinterpret results in either direction.
null
taion
The problem with this approach is that it requires the system doing randomization to be aware of the rewards. That doesn't make a lot of sense architecturally – the rewards you care about often relate to how the user engages with your product, and you would generally expect those to be collected via some offline analytics system that is disjoint from your online serving system.
Additionally, doing randomization on a per-request basis heavily limits the kinds of user behaviors you can observe. Often you want to consistently assign the same user to the same condition to observe long-term changes in user behavior.
This approach is pretty clever on paper but it's a poor fit for how experimentation works in practice and from a system design POV.
hruk
I don't know, all of these are pretty surmountable. We've done dynamic pricing with contextual multi-armed bandits, in which each context gets a single decision per time block and gross profit is summed up at the end of each block and used to reward the agent.
That being said, I agree that MABs are poor for experimentation (they produce biased estimates that depend on somewhat hard-to-quantify properties of your policy). But they're not for experimentation! They're for optimizing a target metric.
hinkley
You do know Amazon got sued and lost for showing different prices to different users? That kind of price discrimination is illegal in the US. Related to actual discrimination.
I think Uber gets away with it because it’s time and location based, not person based. Of course if someone starts pointing out that segregation by neighborhoods is still a thing, they might lose their shiny toys.
empiko
Surmountable, yes, but in practice it is often just too much hassle. If you are doing tons of these tests you can probably afford to invest in the infrastructure for this, but otherwise AB is just so much easier to deploy that it does not really matter to you that you will have a slightly ineffective algo out there for a few days. The interpretation of the results is also easier as you don't have to worry about time sensitivity of the collected data.
jacob019
Hey, I'd love to hear more about dynamic pricing with contextual multi-armed bandits. If you're willing to share your experience, you can find my email on my profile.
ivalm
You can assign multiarm bandit trials on a lazy per user basis.
So first time user touches feature A they are assigned to some trial arm T_A and then all subsequent interactions keep them in that trial arm until the trial finishes.
kridsdale1
The systems I’ve use pre-allocate users effectively randomly an arm by hashing their user id or equivalent.
hinkley
Just make sure you do the hash right so you don’t end up with cursed user IDs like EverQuest.
ivalm
To make sure user id U doesn’t always end up in eg control group it’s useful to concatenate the id with experiment uuid.
ryan-duve
How do you handle different users having different numbers of trials when calculating the "click through rate" described in the article?
s1mplicissimus
careful when doing that though! i've seen some big eyes when people assumed IDs to be uniform randomly distributed and suddenly their "test group" was 15% instead of the intended 1%. better generate a truely random value using your languages favorite crypto functions and be able to work with it without fear of busting production
isoprophlex
As one of the comments below the article states, the probabilistic alternative to epsilon-greedy is worth exploring ad well. Take the "bayesian bandit", which is not much more complex but a lot more powerful.
If you crave more bandits: https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
timr
Just a warning to those people who are potentially implementing it: it doesn't really matter. The blog author addresses this, obliquely (says that the simplest thing is best most of the time), but doesn't make it explicit.
In my experience, obsessing on the best decision strategy is the biggest honeypot for engineers implementing MAB. Epsilon-greedy is very easy to implement and you probably don't need anything more. Thompson sampling is a pain in the butt, for not much gain.
blagie
"Easy to implement" is a good reason to use bubble sort too.
In a normal universe, you just import a different library, so both are the same amount of work to implement.
Multiarmed bandit seems theoretically pretty, but it's rarely worth it. The complexity isn't the numerical algorithm but state management.
* Most AB tests can be as simple as a client-side random() and a log file.
* Multiarmed bandit means you need an immediate feedback loop, which involves things like adding database columns, worrying about performance (since each render requires another database read), etc. Keep in mind the database needs to now store AB test outcomes and use those for decision-making, and computing those is sometimes nontrivial (if it's anything beyond a click-through).
* Long-term outcomes matter more than short-term. "Did we retain a customer" is more important than "did we close one sale."
In most systems, the benefits aren't worth the complexity. Multiple AB tests also add testing complexity. You want to test three layouts? And three user flows? Now, you have nine cases which need to be tested. Add two color schemes? 18 cases. Add 3 font options? 54 cases. The exponential growth in testing is not fun. Fire-and-forget seems great, but in practice, it's fire-and-maintain-exponential complexity.
And those conversion differences are usually small enough that being on the wrong side of a single AB test isn't expensive.
Run the test. Analyze the data. Pick the outcome. Kill the other code path. Perhaps re-analyze the data a year later with different, longer-term metrics. Repeat. That's the right level of complexity most of the time.
If you step up to multiarm, importing a different library ain't bad.
ted_dunning
Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.
So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.
bartread
Sorry but bubble sort is a terrible example here. You implement a more difficult sorting algorithm, like quicksort, because the benefits of doing so, versus using bubble sort, are in many cases huge. I.e., the juice is worth the squeeze.
Whereas the comment you’re responding to is rightly pointing out that for most orgs, the marginal gains of using an approach more complex than Epsilon greedy probably aren’t worth it. I.e., the juice isn’t worth the squeeze.
timr
You've either missed the point of what I wrote, or you're arguing with someone else.
I'm talking about the difference between epsilon-greedy vs. a more complex optimization scheme within the context of implementing MAB. You're making arguments about A/B testing vs MAB.
krackers
There's a good derivation of EXP3 algorithm from standard multiplicative weights which is fairly intuitive. The transformation between the two is explained a bit in https://nerva.cs.uni-bonn.de/lib/exe/fetch.php/teaching/ws18.... Once you have the intuition, then the actual choice of parameters is just cranking out the math
hruk
We've been happy using Thompson sampling in production with this library https://github.com/bayesianbandits/bayesianbandits
tracerbulletx
A lot of sites don't have enough traffic to get statistical significance with this in a reasonable amount of time and it's almost always testing a feature more complicated than button color where you aren't going to have more than the control and variant.
kridsdale1
I’ve only implemented A/B/C tests at Facebook and Google, with hundreds of millions of DAU on the surfaces in question, and three groups is still often enough to dilute the measurement in question below stat-sig.
ryan-duve
> A lot of sites don't have enough traffic to get statistical significance with this in a reasonable amount of time
What's nice about AB testing is the decision can be made on point estimates, provided the two choices don't have different operational "costs". You don't need to know that A is better than B, you just need to pick one and the point estimate gives the best answer with the available data.
I don't know of a way to determine whether A is better than B with statistical significance without letting the experiment run, in practice, for way too long.
wiml
If the effect size x site traffic is so small it's statistically insignificant, why are you doing all this work in the first place? Just choose the option that makes the PHB happy and move on.
(But, it's more likely that you don't know if there's a significant effect size)
douglee650
Yes wondering what the confidence intervals are.
null
munro
Here's an interesting write up on various algorithms & different epsilon greedy % values.
https://github.com/raffg/multi_armed_bandit
It shows 10% exploration performs the best, very great simple algorithm.
Also it shows the Thompson Sampling algorithm converges a bit faster-- the best arm chosen by sampling from the beta distribution, and eliminates the explore phase. And you can use the builtin random.betavariate !
https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...
SideQuark
What an odd, self-contradictory post. "In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer.", along with phrases like "defective by design" and implied loss by not giving all people in a drug trial the new medicine, imply he thinks all of a thing needs allocated to what he perceives (incorrectly, trivially, shown below) as the best use. Then he states in the multiarm bandit to waste (from his multiple other statements) 10% of what is the best use on random other uses for exploration.
However all this fails. For optimal output (be it drug research, allocation of brains, how to run a life), putting all resources on the problem/thing that is the "most important" is sub-optimal use of resources. It's always better expected return to allocate resources to where that spent resource has the best return. If that place is apps, not cancer, then wishing for brains to work on cancer because some would view that as a more important problem may simply be a waste of brains.
So if cancer is going to be incredibly hard to solve, and mankind empirically gets utility from better apps, then a better use is to put those brains on apps - then they're not wasted on an probably not solvable problem and are put to use making things that do increase value.
He also ignores that in real life the cost to have a zillion running experiments constantly flipping alternatives does not scale, so in no way can a company at scale replace A/B with multiarm bandits. One reason is simple: at any time a large company is running 1000s to maybe 100k A/B tests, each running maybe 6 months, at which point code path is selected, dead paths removed, and this repeats continually.If that old code is not killed, and every feature from all time needs to be on/off randomly, then there is no way over time to move much of the app forwards. It's not effective or feasibly to build many new features if you must also allow interacting with those from 5-10 years ago.
A simple google shows tons more reasons, form math to practical, that this post is bad advice.
crazygringo
No, multi-armed bandit doesn't "beat" A/B testing, nor does it beat it "every time".
Statistical significance is statistical significance, end of story. If you want to show that option B is better than A, then you need to test B enough times.
It doesn't matter if you test it half the time (in the simplest A/B) or 10% of the time (as suggested in the article). If you do it 10% of the time, it's just going to take you five times longer.
And A/B testing can handle multiple options just fine, contrary to the post. The name "A/B" suggests two, but you're free to use more, and this is extremely common. It's still called "A/B testing".
Generally speaking, you want to find the best option and then remove the other ones because they're suboptimal and code cruft. The author suggests always keeping 10% exploring other options. But if you already know they're worse, that's just making your product worse for those 10% of users.
LPisGood
Multi-arm bandit does beat A/B testing in the sense that standard A/B testing does not seek to maximize reward during the testing period, MAB does. MAB also generalizes better to testing many things than A/B testing.
cle
This is a double-edged sword. There are often cases in real-world systems where the "reward" the MAB maximizes is biased by eligibility issues, system caching, bugs, etc. If this happens, your MAB has the potential to converge on the worst possible experience for your users, something a static treatment allocation won't do.
LPisGood
I haven’t seen these particular shortcomings before, but I certainly agree that if your data is bad, this ML approach will also be bad.
Can you share some more details about your experiences with those particular types of failures?
crazygringo
No -- you can't have your cake and eat it too.
You get zero benefits from MAB over A/B if you simply end your A/B test once you've achieved statistical significance and pick the best option. Which is what any efficient A/B test does -- there no reason to have any fixed "testing period" beyond what is needed to achieve statistical significance.
While, to the contrary, the MAB described in the article does not maximize reward -- as I explained in my previous comment. Because the post's version runs indefinitely, it has worse long-term reward because it continues to test inferior options long after they've been proven worse. If you leave it running, you're harming yourself.
And I have no idea what you mean by MAB "generalizing" more. But it doesn't matter if it's worse to begin with.
(Also, it's a huge red flag that the post doesn't even mention statistical significance.)
LPisGood
> you can't have your cake and eat it too
I disagree. There is a vast array of literature on solving the MAB problem that may as well be grouped into a bin called “how to optimally strike a balance between having one’s cake and eating it too.”
The optimization techniques to solve MAB problem seek to optimize reward by giving the right balance of exploration and exploitation. In other words, these techniques attempt to determine the optimal way to strike a balance between exploring if another option is better and exploiting the option currently predicted to be best.
There is a strong reason this literature doesn’t start and end with: “just do A/B testing, there is no better approach”
null
MichaelDickens
# for each lever,
# calculate the expectation of reward.
# This is the number of trials of the lever divided by the total reward
# given by that lever.
# choose the lever with the greatest expectation of reward.
If I'm not mistaken, this pseudocode has a bug that will result in choosing the expected worst option rather than the expected best option. I believe it should read "total reward given by the lever divided by the number of trials of that lever".reader9274
Correct, that's why I don't trust reading code comments
jbentley1
Multi-armed bandits make a big assumption that effectiveness is static over time. What can happen is that if they tip traffic slightly towards option B at a time when effectiveness is higher (maybe a sale just started) B will start to overwhelmingly look like a winner and get locked in that state.
You can solve this with propensity scores, but it is more complicated to implement and you need to log every interaction.
LPisGood
This objection is mentioned specifically in the post.
You can add a forgetting factor for older results.
randomcatuser
This seems like a fudge factor though. Some things are changed bc you act on them! (e.g. recommendation systems that are biased towards more popular content). So having dynamic groups makes the data harder to analyze
LPisGood
A standard formulation of MAB problem assumes that acting will impact the rewards, and this forgetting factor approach is one which allows for that and still attempts to find the currently most exploitable lever.
lern_too_spel
That's a different problem. In jbentley1's scenario, A could be better, but this algorithm will choose B.
iforgot22
I don't like how this dismisses the old approach as "statistics are hard for most people to understand." This algo beats A/B testing in terms of maximizing how many visitors get the best feature. But is that really a big enough concern IRL that people are interested in optimizing it every time? Every little dynamic lever adds complexity to a system.
recursivecaveat
Indeed perhaps we should applaud people for choosing statistical tools that are relatively easy to use and interpret, rather than deride them for not stepping up to the lathe that they didn't really need and we admit has lots of sharp edges.
rerdavies
I think you missed the point. It's not about which visitors get the best feature. It's about how to get people to PUSH THE BUTTON!!!!! Which is kind of the opposite of the best feature. The goal is to make people do something they don't want to do.
Figuring out best features is a completely different problem.
iforgot22
I didn't say it was the best for the user. Really the article misses this by comparing a new UI feature to a life-saving drug, but it doesn't matter. The point is, whatever metric you're targeting, do you use this algo or fixed group sizes?
randomcatuser
Yeah basically. The idea is that somehow this is the data-optimal way of determining which one is the best (rather than splitting your data 50/50 and wasting a lot of samples when you already know)
The caveats (perhaps not mentioned in the article) are: - Perhaps you have many metrics you need to track/analyze (CTR, conversion, rates on different metrics), so you can't strictly do bandit! - As someone mentioned below, sometimes the situation is dynamic (so having evenly sized groups helps with capturing this effect) - Maybe some other ones I can't think of?
But you can imagine this kind of auto-testing being useful... imagine AI continually pushes new variants, and it just continually learns which one is the best
cle
It still misses the biggest challenge though--defining "best", and ensuring you're actually measuring it and not something else.
It's useful as long as your definition is good enough and your measurements and randomizations aren't biased. Are you monitoring this over time to ensure that it continues to hold? If you don't, you risk your MAB converging on something very different from what you would consider "the best".
When it converges on the right thing, it's better. When it converges on the wrong thing, it's worse. Which will it do? What's the magnitude of the upside vs downside?
desert_rue
Are you saying that it may do something like improve click-the-button conversion but lead to less sales overall?
iforgot22
Facebook or YouTube might already be using an algo like this or AI to push variants, but for each billion user product, there are probably thousands of smaller products that don't need something this automated.
royal-fig
If multi-arm bandits has piqued your curiosity, we recently added support for it to our feature flagging and experimentation platform, GrowthBook.
We talk about it here: https://blog.growthbook.io/introducing-multi-armed-bandits-i...
Pure, disinterested A/B testing where the goal is just to find the good way to do it, and there's enough leverage and traffic that funding that A/B testing is worthwhile is rare.
More frequently, A/B testing is a political technology that allows teams to move forward with changes to core, vital services of a site or app. By putting a new change behind an A/B test, the team technically derisks the change, by allowing it to be undone rapidly, and politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users. The change was judged to be valuable when development effort went into it, whether for technical, branding or other reasons.
In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements. Two path tests solve the more common problem of wanting to make major changes to critical workflows without killing the platform.