A Summary of Book Trustworthy Online Controlled Experiments

18 January, 2021

Tags:

work

Recently I finished reading the book Trustworthy Online Controlled Experiments, and here I put together my reading notes.

The book has five parts. The first two parts are a high-level introduction of online controlled experiments. Why do we need to conduct controlled experiments? What is the evaluation metric for the experiments? The last three parts are more technically focused. The third part introduces some alternative methods when controlled experiments are not feasible. The fourth part focused on building the experimental platform. The last part mainly focused on how to analyze results from our experiments, what are the potential pitfalls and methods to improve it.

1. Why should we use A/B Testing?

Imagine we introduce a new feature into our online service, and we see an increase in traffic. Can we claim that if we roll out the feature to all customers can increase user engagement? Not necessarily. The feature can be positively or negatively correlated with traffic or have nothing to do with traffic. A/B testing can help with establishing causality with high confidence, have more power to detect small/unexpected changes.

2. How to design the experiment?

2.1 Create metric based on objective

In reality, we usually have more than one business metric to use in experimental design. These metrics must be measurable, attributable, sensitive, and timely. One way to combine multiple metrics is to normalize each metric and then create a weighted combination of them, such as credit score. To decide how many metrics to use, one rough rule of thumb is to limit the number of metrics to five. Because of multiple testing problem, if we have too many metrics, the probability of seeing a significant result is high.

2.2 A/A test

Why we need A/A test?

A/A test is almost the same as the A/B test, but the treatment and control group receive the same treatment. Some benefits of running the A/A test.

Ensure no bias between the treatment and control group. For example, if we use the same users in the last experiment in the current experiment, there might be residual effects, tests from the last experiment can influence the current experiment.
Assess metric variability. If we have more and more data over time, we want to see how the data distribution change over time.

How to run A/A test?

Simulate thousands of experiments, check if the distribution of p-value is uniform. Sometimes running thousands of tests can be expensive. One workaround is to use data from the previous experiment. For example, we stored the experiment results from last week, and then reassign the user into the treatment and control group, and then calculate the p-value. To check if the distribution follows a uniform distribution, we can run the goodness-of-fit test such as Anderson-Darling and Kolmogorov–Smirnov (KS) to check if it follows a uniform distribution.

A/A test fails?

The reason could be outliers in the data or metric has a highly skewed distribution. In this case, we can cap the data.

2.3 Choose significance level, power

The significance level is the one we compare with the p-value, if the p-value is less than the significance level, then we can say that the p-value is significant, which means that we should reject the null hypothesis.

Power is the probability the test can detect the significance difference(positive) when it is actually positive.

2.4 Calculate sample size

Based on the power and significance level, we can calculate the sample size we need for the experiment.

Besides, to play around with significance level and power, we can also transform or cap the metric to change the required sample size.

For metrics that have higher skewness, if we do some transformation or cap it. Then the required sample size will be reduced.

2.5 Decide time/population/unit to run the experiments

Since some users will only visit the website once in the online experiment, thus duration is also important when considering sample size.

The usual approach is the randomization unit and analysis unit are the same. One example is the randomization unit is user, and the analysis unit is click-per-user, then the calculation is easier to compute. But if the randomization unit is different than the analysis unit, for example, if the randomization unit is user, but the analysis unit is click-through-rate-per-page, then if a bot exists, it can generate thousands of page view using one user ID, in this case, we can limit the number of page-view per user to avoid such outliers. We need to use the bootstrap and delta method.

The most common one is user-based randomization. We can track it by user login-ID, cookie ID.

2.6 Analyze results

Irrelevant metric significant: multiple testing problem

When we run thousands of tests, or on different metrics, it is likely we get significant results even they do not make sense. This is also known as the multiple testing problem. One solution is to separate metrics into different tiers. For each tier, we give them a different significance level. Another common solution to the multiple testing problem is Bonferroni correction.

Improve power/sensitivity

Choose the metric that has a smaller variance
Transform metric through cap, binarization, log transformation.
Triggered analysis
Stratification
Randomization at a more granular level
Paired experiment
Pool control groups

Sample Ratio Mismatch (SRM)

SRM is a guardrail metric that ensures the validity of the experiment results. When we set up the experiment, we have a ratio of users between the treatment and control group, the experimental results should close to the experimental design ratio. When the p-value from the t-test or chi-square test is low, then there is a problem of SRM. The metrics we used are likely to be invalid.

Novelty effects

When the new feature introduced, users might be uses a lot due to it is new, and as time goes by, the users might use it much less. One way to detect this is to plot usage over time.

3. When is AB testing not a good idea?

There are cases when A/B testing is not working.

1. We can not control user behavior.

For example, we can not control certain user behavior. such as ask the user to switch their phone.

2. High opportunity cost

If users do not receive treatment, we might lose money. For example, we want to run the ad on the event only happens once a year.

3. Leakage and Interference between variants

If the users are interacting with each other, also known as the network effect, then control and treatment groups are not independent. In this case, we need to create isolation to make sure that the units in the treatment and control group are independent. For example, we can use geometric-based isolation when conducting design on a social network.

4. Experiments require a long time to take effects

There are some experiments that require a longer time to run. The long term and short term effect can be different. Below are several reasons.

User-learned effect: The user might need a longer time to learn the new feature.

Delayed effect: There is a large time gap between the feature launch to the time the treatment takes effect. For example, there could be months between a customer book a hotel to actually go there.

Network effect: For example, in a two-sided marketplace, introduce a new feature can increase the demand, but the supply needs time to catch up, thus the treatment effects take longer to measure.

Ecosystem change: Policy changes, seasonality, competitor’s similar features.

Keep the experiment running for a long time can introduce survivor bias, and the feature can also interact with other new features as time evolves.

Alternative methods such as cohort analysis, the reverse experiment can be used to measure the long-term effect.

4. Alternative methods when AB testing is expensive or not feasible

The observational causal study is one method when the controlled experiment is not feasible.

4.1 observational causal study

Outcome for treated -Outcome for untreated = Outcome for treated - Outcome for treated if not treated + Outcome for treated if not treated - Outcome for untreated if treated = Impact of treatment on treated + Selection Bias

If it is a randomized controlled experiment, then the expected value of selection bias is zero. But in cases mentioned in part 4, it is not. That is why causal study comes into play. In contrast to A/B testing, the causal study has no randomized assignment on the unit, it looks at historical data. Though both causal study and retrospective data analysis are using historical study. The goals are different, the goal of the causal study is to find the causality relationship.

Methods

1. Interrupted time series (ITS)

ITS is a quasi-experimental design, which we can control the change, but not randomize the unit. We test the treatment/control of the same population over time. The main confounding effect is the time-based effect such as seasonality.

2. Interleaved experiment

Interleaved experiment is commonly used to evaluate the ranking algorithm. One example is to mix results from two algorithms, and then compare the click-through-rate from two algorithms.

3. Regression Discontinuity Design (RDD)

RDD is commonly used when there is a clear threshold that identifies the treatment group. We select the group which has a threshold right above the threshold as the treatment group, and those with a threshold right below the threshold as the control group. By doing this, it can reduce selection bias. One main issue of RDD is again the confounding effects. The results can be contaminated if there is another important factor that has the same threshold. For example, if we want to test the alcohol assumption at legal drinking age 21, but gambling legal age is also 21.

4. Instrumental variable(IV) and Natural Experiment

IV approximate the random assignment. For example, to compare the earnings between Veterans, an instrument can be the Vietnam war draft lottery. Natural experiment , such as twins in twins medicine study.

5. Propensity score matching

Similar to stratified sampling, PSM segment users into groups by matching on a constructed propensity score.

6. Difference in difference (DD)

We make a treatment to treatment group at a certain time T, and then we note the difference of treatment group before and after time T, compare it with the difference of control group before and after time T. The difference in the control group over time can capture external factors such as seasonality, inflation.

Pitfalls

Confounding effects and deceptive correlations are common pitfalls in observational causal studies.

4.2 More methods

User experience research
Focus group
Survey
Log-based analysis

The methods above can also be used when A/B testing is not feasible or expensive to run.

References:

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge: Cambridge University Press. doi:10.1017/9781108653985