Skip to main content

A/B Testing and Hypothesis Testing

AB Test with Hypotest

Agenda

  1. A/B Test Refresher
  2. Hypothesis Testing
  3. Common Hypothesis Tests

AB Test

AB Test

  • Compares conversion rates between 2 designs
  • The most basic method is to compare sample means

Conversion Rate=Number of ConversionsNumber of Visitors\text{Conversion Rate} = \frac{\text{Number of Conversions}}{\text{Number of Visitors}}

Example

  1. Compare click-through rates of two designs A and B

    • A: 10001000 visitors, 100100 clicks
    • B: 10001000 visitors, 120120 clicks
  2. Compare sample means

    • A: 0.1=10%0.1 = 10\%
    • B: 0.12=12%0.12 = 12\%
  3. Therefore, design B has a higher CVR

Question

  • Is that conclusion correct?
  • With what degree of confidence was that conclusion reached?
  • (Statistically) the above conclusion cannot be drawn

Deep Dive Supplement

  • Assuming CVRA=CVRB\text{CVR}_A = \text{CVR}_B:
pˉ=100+1201000+1000=0.11Z=XˉAXˉB(1nA+1nB)pˉ(1pˉ)=0.10.12(11000+11000)×0.11×0.891.43p=0.153\begin{align*} \bar{p} &= \frac{100 + 120}{1000 + 1000} = 0.11 \\ Z &= \frac{\bar{X}_A - \bar{X}_B}{\sqrt{(\frac{1}{n_A} + \frac{1}{n_B}) \bar{p} (1 - \bar{p})}} \\ &= \frac{0.1 - 0.12}{\sqrt{(\frac{1}{1000} + \frac{1}{1000}) \times 0.11 \times 0.89}} \\ &\approx -1.43 \rightarrow p = 0.153 \end{align*}

Hypothesis Testing

Hypothesis Testing

  • A method for statistically verifying whether a hypothesis is correct based on observed values
  • Example: To verify whether a coin is fair, flip it several times and observe the results

How to

  1. State the hypothesis you want to test (H1H_1)
  2. Assume the opposite of that hypothesis (H0H_0)
  3. Calculate the probability of observing the result under that assumption (pp-value) *1
  4. If the pp-value is very low, conclude that H0H_0 is wrong *2
  5. In that case, conclude that H1H_1 is correct

Notes

  1. The pp-value is precisely "the probability of observing the observed value or a more extreme value under H0H_0"
  2. "Very low pp-value" means lower than the significance level α\alpha; commonly α=0.05\alpha=0.05 or 0.010.01 is used.

Example 1

We want to test whether a coin is rigged.

  • H0H_0: This coin is fair -> probability of heads is 50%
  • H1H_1: This coin is rigged -> probability of heads is not 50%
  • Observation: Flipping the coin 2020 times resulted in 1515 heads

Example 2

  • Under H0H_0, the probability of getting xx heads is as follows:

w:630

Example 3

  • Calculate the probability of getting 1515 or more heads, or 55 or fewer heads
  • This is calculated by summing the probabilities from "probability of 15 heads" to "probability of 5 heads" (the red regions in the graph)

w:470

Example 4

  • Can also be calculated directly

P(15 or more heads, or 5 or fewer)=P(15 heads)+P(16 heads)++P(20 heads)+P(5 heads)++P(0 heads)P(\text{15 or more heads, or 5 or fewer}) = P(\text{15 heads}) + P(\text{16 heads}) + \cdots + P(\text{20 heads}) + P(\text{5 heads}) + \cdots + P(\text{0 heads}) =(2015)(12)15(12)5++(2020)(12)20+(205)(12)5(12)15++(200)(12)0(12)20= \binom{20}{15} \left(\frac{1}{2}\right)^{15} \left(\frac{1}{2}\right)^{5} + \cdots + \binom{20}{20} \left(\frac{1}{2}\right)^{20} + \binom{20}{5} \left(\frac{1}{2}\right)^{5} \left(\frac{1}{2}\right)^{15} + \cdots + \binom{20}{0} \left(\frac{1}{2}\right)^{0} \left(\frac{1}{2}\right)^{20} =0.041 = 0.041

Example 5

  • The pp-value is 0.0410.041
  • Since it is below 5%5\%, we conclude that the hypothesis is wrong
  • Therefore, we reject H0H_0 and accept H1H_1

-> This coin is rigged

Deep Dive Supplement

Two-tailed Test vs One-tailed Test
  • A two-tailed test verifies whether the CVR is exactly a specific value
  • A one-tailed test verifies whether the CVR is higher or lower than a specific value
  • For p0.5p\leq0.5, calculate the probability in the left figure; for p0.5p\geq0.5, calculate the probability in the right figure
leftright

Hypothesis Testing for AB Test

Just like coin flips, hypothesis testing can be applied to A/B tests.

  • H0H_0: The CVR of the new design is xx%
  • H1H_1: The CVR of the new design is higher than xx%
  • pp-value: The probability of observing the observed number of clicks or more under H0H_0

Hypothesis Testing for AB Test + 1

Hypothesis testing can also be performed between designs.

  • H0H_0: The CVR of design A and design B are equal
  • H1H_1: The CVR of design A and design B are different
  • pp-value: Calculate the difference between CVRA\text{CVR}_A and CVRB\text{CVR}_B, and the probability of observing that difference (normal distribution)

Common Hypothesis Tests

Common Hypothesis Tests

  • Binomial test (the one we just did)
  • Fisher's exact test
  • χ2\chi^2 test
  • tt-test
  • Wilcoxon rank-sum test
  • FF-test

We will introduce a few of these.

Binomial Test

  • Used for binary "success" or "failure" data such as CVR
  • Example: Test whether a design's CVR is 0.10.1

bg right contain

χ2\chi^2 Test

  • Used for the same purpose as the binomial test
  • Used when the sample size is sufficiently large, approximately 30n30\leq n

bg right contain

tt-test

  • Can also be used for continuous data
  • Tests differences in means
  • Example: Test whether the average time spent on a design is 1010 seconds
  • When data is paired, use the paired tt-test

bg right contain

Summary

Summary

  • Simple analysis can have unexpected pitfalls
  • Support decision-making with statistical evidence
  • Hypothesis testing is a powerful tool for that purpose

Bonus

Why Do Engineers Need This Knowledge?

  • Eliminate subjectivity
  • Accelerate the PDCA cycle
  • Design systems and databases optimized for user data utilization
  • (More details next time) Perform analyses that incorporate domain expertise

Next time, we will present Thinking About A/B Testing with Bayesian Statistics (it is going to be challenging)