A/B Testing and Hypothesis Testing

AB Test with Hypotest

Agenda

A/B Test Refresher
Hypothesis Testing
Common Hypothesis Tests

AB Test

Compares conversion rates between 2 designs
The most basic method is to compare sample means

$\text{Conversion Rate} = \frac{\text{Number of Conversions}}{\text{Number of Visitors}}$

Example

Compare click-through rates of two designs A and B
- A: $1000$ visitors, $100$ clicks
- B: $1000$ visitors, $120$ clicks
Compare sample means
- A: $0.1 = 10\%$
- B: $0.12 = 12\%$
Therefore, design B has a higher CVR

Question

Is that conclusion correct?
With what degree of confidence was that conclusion reached?
(Statistically) the above conclusion cannot be drawn

Deep Dive Supplement

Assuming $\text{CVR}_A = \text{CVR}_B$ :

\begin{align*} \bar{p} &= \frac{100 + 120}{1000 + 1000} = 0.11 \\ Z &= \frac{\bar{X}_A - \bar{X}_B}{\sqrt{(\frac{1}{n_A} + \frac{1}{n_B}) \bar{p} (1 - \bar{p})}} \\ &= \frac{0.1 - 0.12}{\sqrt{(\frac{1}{1000} + \frac{1}{1000}) \times 0.11 \times 0.89}} \\ &\approx -1.43 \rightarrow p = 0.153 \end{align*}

Hypothesis Testing

A method for statistically verifying whether a hypothesis is correct based on observed values
Example: To verify whether a coin is fair, flip it several times and observe the results

How to

State the hypothesis you want to test ( $H_1$ )
Assume the opposite of that hypothesis ( $H_0$ )
Calculate the probability of observing the result under that assumption ( $p$ -value) *1
If the $p$ -value is very low, conclude that $H_0$ is wrong *2
In that case, conclude that $H_1$ is correct

Notes

The $p$ -value is precisely "the probability of observing the observed value or a more extreme value under $H_0$ "
"Very low $p$ -value" means lower than the significance level $\alpha$ ; commonly $\alpha=0.05$ or $0.01$ is used.

Example 1

We want to test whether a coin is rigged.

$H_0$ : This coin is fair -> probability of heads is 50%
$H_1$ : This coin is rigged -> probability of heads is not 50%
Observation: Flipping the coin $20$ times resulted in $15$ heads

Example 2

Under $H_0$ , the probability of getting $x$ heads is as follows:

w:630

Example 3

Calculate the probability of getting $15$ or more heads, or $5$ or fewer heads
This is calculated by summing the probabilities from "probability of 15 heads" to "probability of 5 heads" (the red regions in the graph)

w:470

Example 4

Can also be calculated directly

$P(\text{15 or more heads, or 5 or fewer}) = P(\text{15 heads}) + P(\text{16 heads}) + \cdots + P(\text{20 heads}) + P(\text{5 heads}) + \cdots + P(\text{0 heads})$ $= \binom{20}{15} \left(\frac{1}{2}\right)^{15} \left(\frac{1}{2}\right)^{5} + \cdots + \binom{20}{20} \left(\frac{1}{2}\right)^{20} + \binom{20}{5} \left(\frac{1}{2}\right)^{5} \left(\frac{1}{2}\right)^{15} + \cdots + \binom{20}{0} \left(\frac{1}{2}\right)^{0} \left(\frac{1}{2}\right)^{20}$ $= 0.041$

Example 5

The $p$ -value is $0.041$
Since it is below $5\%$ , we conclude that the hypothesis is wrong
Therefore, we reject $H_0$ and accept $H_1$

-> This coin is rigged

Deep Dive Supplement

Two-tailed Test vs One-tailed Test

A two-tailed test verifies whether the CVR is exactly a specific value
A one-tailed test verifies whether the CVR is higher or lower than a specific value
For $p\leq0.5$ , calculate the probability in the left figure; for $p\geq0.5$ , calculate the probability in the right figure

Hypothesis Testing for AB Test

Just like coin flips, hypothesis testing can be applied to A/B tests.

$H_0$ : The CVR of the new design is $x$ %
$H_1$ : The CVR of the new design is higher than $x$ %
$p$ -value: The probability of observing the observed number of clicks or more under $H_0$

Hypothesis Testing for AB Test + 1

Hypothesis testing can also be performed between designs.

$H_0$ : The CVR of design A and design B are equal
$H_1$ : The CVR of design A and design B are different
$p$ -value: Calculate the difference between $\text{CVR}_A$ and $\text{CVR}_B$ , and the probability of observing that difference (normal distribution)

Common Hypothesis Tests

Binomial test (the one we just did)
Fisher's exact test
$\chi^2$ test
$t$ -test
Wilcoxon rank-sum test
$F$ -test

We will introduce a few of these.

Binomial Test

Used for binary "success" or "failure" data such as CVR
Example: Test whether a design's CVR is $0.1$

bg right contain

$\chi^2$ Test

Used for the same purpose as the binomial test
Used when the sample size is sufficiently large, approximately $30\leq n$

bg right contain

$t$ -test

Can also be used for continuous data
Tests differences in means
Example: Test whether the average time spent on a design is $10$ seconds
When data is paired, use the paired $t$ -test

bg right contain

Summary

Simple analysis can have unexpected pitfalls
Support decision-making with statistical evidence
Hypothesis testing is a powerful tool for that purpose

Bonus

Why Do Engineers Need This Knowledge?

Eliminate subjectivity
Accelerate the PDCA cycle
Design systems and databases optimized for user data utilization
(More details next time) Perform analyses that incorporate domain expertise

Next time, we will present Thinking About A/B Testing with Bayesian Statistics (it is going to be challenging)

AB Test with Hypotest​

Agenda​

AB Test​

AB Test​

Example​

Question​

Deep Dive Supplement​

Hypothesis Testing​

Hypothesis Testing​

How to​

Notes​

Example 1​

Example 2​

Example 3​

Example 4​

Example 5​

Deep Dive Supplement​

Two-tailed Test vs One-tailed Test​

Hypothesis Testing for AB Test​

Hypothesis Testing for AB Test + 1​

Common Hypothesis Tests​

Common Hypothesis Tests​

Binomial Test​

χ2\chi^2χ2 Test​

ttt-test​

Summary​

Summary​

Bonus​

Why Do Engineers Need This Knowledge?​

AB Test with Hypotest

Agenda

AB Test

AB Test

Example

Question

Deep Dive Supplement

Hypothesis Testing

Hypothesis Testing

How to

Notes

Example 1

Example 2

Example 3

Example 4

Example 5

Deep Dive Supplement

Two-tailed Test vs One-tailed Test

Hypothesis Testing for AB Test

Hypothesis Testing for AB Test + 1

Common Hypothesis Tests

Common Hypothesis Tests

Binomial Test

$\chi^2$ Test

$t$ -test

Summary

Summary

Bonus

Why Do Engineers Need This Knowledge?