← Back to blog

MAB vs. A/B Testing: Choosing the Right Algorithm for Growth

TL;DRIn product optimization, multi-armed bandit algorithms (MAB) outperform serial A/B testing (SAB) by requiring significantly less traffic to identify winning variants and delivering a higher overall conversion rate (CVR) uplift. Hence, Levered uses specialized, contextual bandit algorithms to efficiently optimize apps and websites. In this article, we explore why MAB algorithms outperform traditional A/B testing in identifying variants that drive conversion improvements. We demonstrate this through a simulation of a typical website optimization scenario.

[This is the short version of an article published here.]

Growth optimization vs. product testing

Incremental product growth optimization is inherently different from traditional product work. That's also why companies such as Meta have dedicated growth orgs that are separate from the core product organization.

In “growth optimization”, testing velocity is key. It differs from classic feature testing in several ways:

As a result, success in growth optimization hinges on a team's ability to identify and accumulate many small wins, whereas core product work focuses more on de-risking bigger bets.

Limitations of classic A/B testing

In the context of growth optimization, A/B testing often falls short due to the large sample size required and the rigidity of the statistical approach:

Multi-Armed Bandits

Multi-armed bandits are not new. They have been used successfully in areas such as search or ads optimization for quite some time. However, they are much less commonly used in product optimization compared to A/B testing.

MAB algorithms balance two competing goals: exploration and exploitation. Exploration aims at testing as many ideas as possible, while exploitation aims to maximize the overall conversion rate by showing ideas that worked to as many users as possible.

An effective technique to balance exploitation and exploration is called “Thompson Sampling.” Simply put, it works like this:

MAB algorithms are particularly data-efficient when paired with so-called “hierarchical Bayesian” models. These models recognize that a product design is a function of multiple variables (so-called “factors”) that may vary in importance.

For example, on a product page, the product image may be more important than the product description and have more influence when it comes to optimizing conversion. The algorithm learns this from user interactions and allocates traffic accordingly—i.e., it prioritizes finding the best image over the best product description.

The hierarchical approach offers a significant advantage over A/B testing, since it helps avoid wasting traffic on finding the best levels of unimportant factors.

Benchmarking algorithms

At Levered, we use custom hierarchical MAB algorithms for automated product growth optimization. But how do we know this works better than running a series of A/B tests? And how can we quantify “better”?

The best way to directly compare the two approaches is by running a simulation. In such a simulation, we define a typical product optimization scenario and then observe how effective each algorithm is in finding “winners” and improving the overall conversion rate over time.

Algorithm comparison: Sequential AB-Testing vs Multi-Armed-Bandit
Figure 1: Algorithm comparison

Defining the scenario

We need to define a scenario that is a fair representation of the problem that a growth or CRO team faces when optimizing a user experience, e.g. a landing page, Shopify store, or the product onboarding journey of a SaaS platform.

Let's define the scenario as follows:

We are optimizing a UX across three variables (“factors”), e.g., the headline, hero image, and CTA copy of a landing page. For each factor, we want to explore four different levels (e.g., four different headlines). This makes 4×4×4 = 64 possible variants.

The three different factors vary in importance and each variant has a “true” conversion rate between 2% and 4% (both ex-ante unknown). The team now aims to find the best-converting variant by either running a series of A/B tests or an MAB optimization.

Optimization scenario with three factors and four levels each
Figure 2: Optimization scenario

Protocol for A/B testing

Starting from a baseline variant and an alternative variant, we assign incoming users to one or the other and compute empirical conversion rates for both variants. When we hit a given sample size, we carry out a pairwise hypothesis test. For this statistical test, we assume under the null hypothesis that the baseline outperforms the alternative. The test then calculates how likely the observed data is under null.

If we fail to reject the null hypothesis, we assume that the effects we've seen in the data so far are from pure chance and stick with the baseline variant. Otherwise, we adopt the alternative as the new baseline.

This process is repeated until all variants have been tested, or the available sample size is exhausted. The final “null” is kept as the winning variant.

Protocol for Multi-Armed-Bandit Optimization (MAB)

For MAB, the protocol is even more straightforward:

All 64 variants are tested at the same time. Initially, all variants have the same weight, i.e. have the same chance of being shown to a user. We then use Thompson Sampling to gradually shift more traffic to the best performers. Ultimately, all traffic is allocated to the winning variant.

[Find a more detailed explanation with code examples in the long version of this article here.]

Quantifying the performance of both methods

With two competing alternatives formally defined, we may now address our experimental protocol. We empirically test MAB and A/B testing on the design seen in Figure 1. Performance of the algorithms is quantified statistically, that is, in terms of average performance over a large number of experiments. We focus on three characteristics:

In the case of MAB, the conversion rate of the system corresponds to the average over all variants, weighted by the optimality probabilities. In the case of A/B testing, it is the average of the conversion rate of the baseline and the alternative.

Conversion uplift of MAB (orange) and AB-testing (blue)
Figure 3: Conversion uplift of MAB (orange) and A/B testing (blue)

We visualize the outcome of the experiment in Figure 3, where we show the mean conversion rate uplift of both MAB and A/B testing as solid lines, as well as the interval which contains 50% of simulations. These are shown as functions of the number of user interactions that the system has carried out. We observe that on average, MAB:

In conjunction, these observations support our conceptual reasoning about the advantages of MAB, and indicate that substantial conversion uplift can be achieved even with moderate traffic volumes.

Discussion

So what does this mean for businesses in need of data-driven product optimization? Should they quit A/B testing and go all in on MAB?

When it comes to testing expensive changes, such as new feature launches, A/B testing still is a valid approach. That's because the risk and expected change in CVR are relatively high. The larger the company and the more traffic the product gets, the more likely it is that A/B testing will work out just fine.

However, when the space of options is large, traffic is sparse, and the potential cost of experimentation is moderate, MAB is the superior alternative, as shown in the simulation above. This applies particularly in the context of CRO in small and medium enterprises.