Skip to main content

Smarter A/B Testing

powered by Reinforcement Learning

Prerequisites

  • A/B testing: A method for comparing two variations to determine which is better
  • Different from gradual rollout in canary releases
  • In this session, we consider an example of comparing N design variations on an e-commerce site

Standard A/B Testing

  1. Randomly assign users to one of A, B, ... N
  2. Display a different design to each group
  3. Compare cart-in rates after a fixed period

Problems

  1. How to determine the test duration (1 week? 1 month?)
  2. Potential for losses during the test period -> Displaying a poor design worsens conversions

Example

When redesigning a product detail page:

  • A: Existing design (cart-in rate: 5%)
  • B: New design (cart-in rate: ??)

Example

Potential losses from A/B testing in this case:

  1. If B is better: A longer A/B test delays migration to B
  2. If A is better: A higher proportion of B increases losses

Solutions

  1. How to determine the test duration
  • -> End when the required number of samples has been collected
  1. Potential for losses during the test period
  • -> Display the better design at a higher rate

Reinforcement Learning

The Exploration-Exploitation Dilemma

  • Exploration: Try new designs
  • Exploitation: Display the current best design

-> Exploration carries risk, but exploitation alone brings no improvement

Multi-Armed Bandit Problem

  • N slot machines
  • Each has a fixed winning probability pnp_n
  • Want to find the machine with the highest winning probability in the fewest trials

bg right contain

Multi-Armed Bandit Problem

In terms of A/B testing...

  • N designs A,B,C...A, B, C...
  • Each has a fixed cart-in rate pA,pB...p_A, p_B ...
  • Want to find the design with the highest cart-in rate in the fewest impressions (shortest test period)

Bandit Algorithms

  • The details are complex, so here is a quick overview
  • Consider an A/B test for N designs A,B,C...A, B, C...
  • CVR = cart-in rate

Bandit Algorithms

  1. Epsilon-greedy
  2. Softmax
  3. Thompson Sampling
  4. UCB1

How to Read the Graphs

  • x-axis: Number of impressions (new visitors)
  • y-axis: Proportion of time the best design was displayed

bg contain right

1. Epsilon-greedy

  • With probability epsilon, explore
    • Display a completely random design
  • With probability 1-epsilon, exploit
    • Display the current best design
  • The best design is determined by the interim CVR of each design
  • Epsilon is typically between 0.1 and 0.3

1. Epsilon-greedy

bg 80%

Epsilon-greedy

  • Strong randomness
  • Epsilon-dependent

bg right:67%

2. Softmax

  • Selects designs probabilistically based on each design's interim CVR
  • Designs with higher CVR have a higher probability of being selected
  • Reduces wasteful exploration

2. Softmax

  • A function that converts any collection of numbers into probabilities

Softmax(xi)=exp(xi)jexp(xj)\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}

h:300

2. Softmax

  • Epsilon-greedy tests all designs equally
  • Softmax preferentially tests designs with higher expected values

h:300

Softmax

  • Fast convergence

bg right:70%

3. Thompson Sampling

  • Estimates the CVR for each design using Bayesian estimation to create distributions, then samples to select a design
  • Overcomes the weaknesses of Softmax
  • The serious approach used by SEO companies and the like

3. Thompson Sampling

  • Create a loaded die for each design based on its CVR
  • Designs with higher CVR get better dice
  • Roll the dice and select the design with the highest roll
  • Update the CVR and dice after each design display

h:300

3. Thompson Sampling

  • Bayesian estimation will be covered another time
  • Softmax treats 12\frac{1}{2} and 50100\frac{50}{100} as equal CVRs, but the latter is less likely to deviate from 12\frac{1}{2}
  • Thompson Sampling produces different dice behavior even for the same CVR when the number of trials differs (Beta distribution)

Thompson Sampling

  • Slower convergence
  • Higher stability

bg right:65%

4. UCB1

  • Selects designs using confidence intervals
  • No randomness
  • Overcomes the weaknesses of Softmax

4. UCB1

  • Confidence interval: The range in which the true value of a prediction is likely to fall
  • Compute the confidence interval of the CVR for each design and display the design with the highest upper bound

h:400

UCB1

  • More wasteful exploration
  • No tuning required

bg right:64%

What About Losses?

How much cumulative conversion was gained compared to displaying A/B at 50% each as in standard A/B testing?

bg right:65%

Advantages of Bandit

  • Converges to the optimal design fastest
  • Can be simulated in advance
    • Required sample size (duration) is known
    • Expected losses are known

Disadvantages of Bandit

  • Requires integration with data sources
  • Does not converge when there is no difference in CVR (becomes standard A/B testing)
  • Requires parameter tuning

Summary

  • Problems with conventional A/B testing
    1. How to determine the test duration
    2. Potential for losses during the test period
  • A/B testing with reinforcement learning
    • Shortest test period
    • Continuously displays the optimal design

Next time, we will introduce

Component-level CVR Impact Analysis

using logistic regression