Smarter A/B Testing

Prerequisites

A/B testing: A method for comparing two variations to determine which is better
Different from gradual rollout in canary releases
In this session, we consider an example of comparing N design variations on an e-commerce site

Standard A/B Testing

Randomly assign users to one of A, B, ... N
Display a different design to each group
Compare cart-in rates after a fixed period

Problems

How to determine the test duration (1 week? 1 month?)
Potential for losses during the test period -> Displaying a poor design worsens conversions

Example

When redesigning a product detail page:

A: Existing design (cart-in rate: 5%)
B: New design (cart-in rate: ??)

Example

Potential losses from A/B testing in this case:

If B is better: A longer A/B test delays migration to B
If A is better: A higher proportion of B increases losses

Solutions

How to determine the test duration

-> End when the required number of samples has been collected

Potential for losses during the test period

-> Display the better design at a higher rate

Reinforcement Learning

The Exploration-Exploitation Dilemma

Exploration: Try new designs
Exploitation: Display the current best design

-> Exploration carries risk, but exploitation alone brings no improvement

Multi-Armed Bandit Problem

N slot machines
Each has a fixed winning probability $p_n$
Want to find the machine with the highest winning probability in the fewest trials

bg right contain

Multi-Armed Bandit Problem

In terms of A/B testing...

N designs $A, B, C...$
Each has a fixed cart-in rate $p_A, p_B ...$
Want to find the design with the highest cart-in rate in the fewest impressions (shortest test period)

Bandit Algorithms

The details are complex, so here is a quick overview
Consider an A/B test for N designs $A, B, C...$
CVR = cart-in rate

Bandit Algorithms

Epsilon-greedy
Softmax
Thompson Sampling
UCB1

How to Read the Graphs

x-axis: Number of impressions (new visitors)
y-axis: Proportion of time the best design was displayed

bg contain right

1. Epsilon-greedy

With probability epsilon, explore
- Display a completely random design
With probability 1-epsilon, exploit
- Display the current best design
The best design is determined by the interim CVR of each design
Epsilon is typically between 0.1 and 0.3

1. Epsilon-greedy

bg 80%

Epsilon-greedy

Strong randomness
Epsilon-dependent

bg right:67%

2. Softmax

Selects designs probabilistically based on each design's interim CVR
Designs with higher CVR have a higher probability of being selected
Reduces wasteful exploration

2. Softmax

A function that converts any collection of numbers into probabilities

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

h:300

2. Softmax

Epsilon-greedy tests all designs equally
Softmax preferentially tests designs with higher expected values

h:300

Softmax

Fast convergence

bg right:70%

3. Thompson Sampling

Estimates the CVR for each design using Bayesian estimation to create distributions, then samples to select a design
Overcomes the weaknesses of Softmax
The serious approach used by SEO companies and the like

3. Thompson Sampling

Create a loaded die for each design based on its CVR
Designs with higher CVR get better dice
Roll the dice and select the design with the highest roll
Update the CVR and dice after each design display

h:300

3. Thompson Sampling

Bayesian estimation will be covered another time
Softmax treats $\frac{1}{2}$ and $\frac{50}{100}$ as equal CVRs, but the latter is less likely to deviate from $\frac{1}{2}$
Thompson Sampling produces different dice behavior even for the same CVR when the number of trials differs (Beta distribution)

Thompson Sampling

Slower convergence
Higher stability

bg right:65%

4. UCB1

Selects designs using confidence intervals
No randomness
Overcomes the weaknesses of Softmax

4. UCB1

Confidence interval: The range in which the true value of a prediction is likely to fall
Compute the confidence interval of the CVR for each design and display the design with the highest upper bound

h:400

UCB1

More wasteful exploration
No tuning required

bg right:64%

What About Losses?

How much cumulative conversion was gained compared to displaying A/B at 50% each as in standard A/B testing?

bg right:65%

Advantages of Bandit

Converges to the optimal design fastest
Can be simulated in advance
- Required sample size (duration) is known
- Expected losses are known

Disadvantages of Bandit

Requires integration with data sources
Does not converge when there is no difference in CVR (becomes standard A/B testing)
Requires parameter tuning

Summary

Problems with conventional A/B testing
1. How to determine the test duration
2. Potential for losses during the test period
A/B testing with reinforcement learning
- Shortest test period
- Continuously displays the optimal design

Next time, we will introduce

Component-level CVR Impact Analysis

using logistic regression

Prerequisites​

Standard A/B Testing​

Problems​

Example​

Example​

Solutions​

Reinforcement Learning​

The Exploration-Exploitation Dilemma​

Multi-Armed Bandit Problem​

Multi-Armed Bandit Problem​

Bandit Algorithms​

Bandit Algorithms​

How to Read the Graphs​

1. Epsilon-greedy​

1. Epsilon-greedy​

Epsilon-greedy​

2. Softmax​

2. Softmax​

2. Softmax​

Softmax​

3. Thompson Sampling​

3. Thompson Sampling​

3. Thompson Sampling​

Thompson Sampling​

4. UCB1​

4. UCB1​

UCB1​

What About Losses?​

Advantages of Bandit​

Disadvantages of Bandit​

Summary​

Component-level CVR Impact Analysis​

Prerequisites

Standard A/B Testing

Problems

Example

Example

Solutions

Reinforcement Learning

The Exploration-Exploitation Dilemma

Multi-Armed Bandit Problem

Multi-Armed Bandit Problem

Bandit Algorithms

Bandit Algorithms

How to Read the Graphs

1. Epsilon-greedy

1. Epsilon-greedy

Epsilon-greedy

2. Softmax

2. Softmax

2. Softmax

Softmax

3. Thompson Sampling

3. Thompson Sampling

3. Thompson Sampling

Thompson Sampling

4. UCB1

4. UCB1

UCB1

What About Losses?

Advantages of Bandit

Disadvantages of Bandit

Summary

Component-level CVR Impact Analysis