Articles on Shuyan Mei

A Summary of Book Trustworthy Online Controlled Experiments

Mon, 18 Jan 2021 14:48:25 -0500

Recently I finished reading the book Trustworthy Online Controlled Experiments, and here I put together my reading notes.

The book has five parts. The first two parts are a high-level introduction of online controlled experiments. Why do we need to conduct controlled experiments? What is the evaluation metric for the experiments? The last three parts are more technically focused. The third part introduces some alternative methods when controlled experiments are not feasible. The fourth part focused on building the experimental platform. The last part mainly focused on how to analyze results from our experiments, what are the potential pitfalls and methods to improve it.

1. Why should we use A/B Testing?

Imagine we introduce a new feature into our online service, and we see an increase in traffic. Can we claim that if we roll out the feature to all customers can increase user engagement? Not necessarily. The feature can be positively or negatively correlated with traffic or have nothing to do with traffic. A/B testing can help with establishing causality with high confidence, have more power to detect small/unexpected changes.

2. How to design the experiment?

2.1 Create metric based on objective

In reality, we usually have more than one business metric to use in experimental design. These metrics must be measurable, attributable, sensitive, and timely. One way to combine multiple metrics is to normalize each metric and then create a weighted combination of them, such as credit score. To decide how many metrics to use, one rough rule of thumb is to limit the number of metrics to five. Because of multiple testing problem, if we have too many metrics, the probability of seeing a significant result is high.

2.2 A/A test

Why we need A/A test?

A/A test is almost the same as the A/B test, but the treatment and control group receive the same treatment. Some benefits of running the A/A test.

Ensure no bias between the treatment and control group. For example, if we use the same users in the last experiment in the current experiment, there might be residual effects, tests from the last experiment can influence the current experiment.
Assess metric variability. If we have more and more data over time, we want to see how the data distribution change over time.

How to run A/A test?

Simulate thousands of experiments, check if the distribution of p-value is uniform. Sometimes running thousands of tests can be expensive. One workaround is to use data from the previous experiment. For example, we stored the experiment results from last week, and then reassign the user into the treatment and control group, and then calculate the p-value. To check if the distribution follows a uniform distribution, we can run the goodness-of-fit test such as Anderson-Darling and Kolmogorov–Smirnov (KS) to check if it follows a uniform distribution.

A/A test fails?

The reason could be outliers in the data or metric has a highly skewed distribution. In this case, we can cap the data.

2.3 Choose significance level, power

The significance level is the one we compare with the p-value, if the p-value is less than the significance level, then we can say that the p-value is significant, which means that we should reject the null hypothesis.

Power is the probability the test can detect the significance difference(positive) when it is actually positive.

2.4 Calculate sample size

Based on the power and significance level, we can calculate the sample size we need for the experiment.

Besides, to play around with significance level and power, we can also transform or cap the metric to change the required sample size.

For metrics that have higher skewness, if we do some transformation or cap it. Then the required sample size will be reduced.

2.5 Decide time/population/unit to run the experiments

Since some users will only visit the website once in the online experiment, thus duration is also important when considering sample size.

The usual approach is the randomization unit and analysis unit are the same. One example is the randomization unit is user, and the analysis unit is click-per-user, then the calculation is easier to compute. But if the randomization unit is different than the analysis unit, for example, if the randomization unit is user, but the analysis unit is click-through-rate-per-page, then if a bot exists, it can generate thousands of page view using one user ID, in this case, we can limit the number of page-view per user to avoid such outliers. We need to use the bootstrap and delta method.

The most common one is user-based randomization. We can track it by user login-ID, cookie ID.

2.6 Analyze results

Irrelevant metric significant: multiple testing problem

When we run thousands of tests, or on different metrics, it is likely we get significant results even they do not make sense. This is also known as the multiple testing problem. One solution is to separate metrics into different tiers. For each tier, we give them a different significance level. Another common solution to the multiple testing problem is Bonferroni correction.

Improve power/sensitivity

Choose the metric that has a smaller variance
Transform metric through cap, binarization, log transformation.
Triggered analysis
Stratification
Randomization at a more granular level
Paired experiment
Pool control groups

Sample Ratio Mismatch (SRM)

SRM is a guardrail metric that ensures the validity of the experiment results. When we set up the experiment, we have a ratio of users between the treatment and control group, the experimental results should close to the experimental design ratio. When the p-value from the t-test or chi-square test is low, then there is a problem of SRM. The metrics we used are likely to be invalid.

Novelty effects

When the new feature introduced, users might be uses a lot due to it is new, and as time goes by, the users might use it much less. One way to detect this is to plot usage over time.

3. When is AB testing not a good idea?

There are cases when A/B testing is not working.

1. We can not control user behavior.

For example, we can not control certain user behavior. such as ask the user to switch their phone.

2. High opportunity cost

If users do not receive treatment, we might lose money. For example, we want to run the ad on the event only happens once a year.

3. Leakage and Interference between variants

If the users are interacting with each other, also known as the network effect, then control and treatment groups are not independent. In this case, we need to create isolation to make sure that the units in the treatment and control group are independent. For example, we can use geometric-based isolation when conducting design on a social network.

4. Experiments require a long time to take effects

There are some experiments that require a longer time to run. The long term and short term effect can be different. Below are several reasons.

User-learned effect: The user might need a longer time to learn the new feature.

Delayed effect: There is a large time gap between the feature launch to the time the treatment takes effect. For example, there could be months between a customer book a hotel to actually go there.

Network effect: For example, in a two-sided marketplace, introduce a new feature can increase the demand, but the supply needs time to catch up, thus the treatment effects take longer to measure.

Ecosystem change: Policy changes, seasonality, competitor’s similar features.

Keep the experiment running for a long time can introduce survivor bias, and the feature can also interact with other new features as time evolves.

Alternative methods such as cohort analysis, the reverse experiment can be used to measure the long-term effect.

4. Alternative methods when AB testing is expensive or not feasible

The observational causal study is one method when the controlled experiment is not feasible.

4.1 observational causal study

Outcome for treated -Outcome for untreated = Outcome for treated - Outcome for treated if not treated + Outcome for treated if not treated - Outcome for untreated if treated = Impact of treatment on treated + Selection Bias

If it is a randomized controlled experiment, then the expected value of selection bias is zero. But in cases mentioned in part 4, it is not. That is why causal study comes into play. In contrast to A/B testing, the causal study has no randomized assignment on the unit, it looks at historical data. Though both causal study and retrospective data analysis are using historical study. The goals are different, the goal of the causal study is to find the causality relationship.

Methods

1. Interrupted time series (ITS)

ITS is a quasi-experimental design, which we can control the change, but not randomize the unit. We test the treatment/control of the same population over time. The main confounding effect is the time-based effect such as seasonality.

2. Interleaved experiment

Interleaved experiment is commonly used to evaluate the ranking algorithm. One example is to mix results from two algorithms, and then compare the click-through-rate from two algorithms.

3. Regression Discontinuity Design (RDD)

RDD is commonly used when there is a clear threshold that identifies the treatment group. We select the group which has a threshold right above the threshold as the treatment group, and those with a threshold right below the threshold as the control group. By doing this, it can reduce selection bias. One main issue of RDD is again the confounding effects. The results can be contaminated if there is another important factor that has the same threshold. For example, if we want to test the alcohol assumption at legal drinking age 21, but gambling legal age is also 21.

4. Instrumental variable(IV) and Natural Experiment

IV approximate the random assignment. For example, to compare the earnings between Veterans, an instrument can be the Vietnam war draft lottery. Natural experiment , such as twins in twins medicine study.

5. Propensity score matching

Similar to stratified sampling, PSM segment users into groups by matching on a constructed propensity score.

6. Difference in difference (DD)

We make a treatment to treatment group at a certain time T, and then we note the difference of treatment group before and after time T, compare it with the difference of control group before and after time T. The difference in the control group over time can capture external factors such as seasonality, inflation.

Pitfalls

Confounding effects and deceptive correlations are common pitfalls in observational causal studies.

4.2 More methods

User experience research
Focus group
Survey
Log-based analysis

The methods above can also be used when A/B testing is not feasible or expensive to run.

References:

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge: Cambridge University Press. doi:10.1017/9781108653985

Optimization Learning Notes

Mon, 28 Dec 2020 16:28:27 -0500

In the past year, I started to pick up some optimization algorithms in work to solve problems like finding optimal prices to maximize business' profits with constraints. While memory is still fresh, I decided to write down my learning notes here. This is not an exhaustive survey of optimization algorithms, it only serves as the learning notes of the optimization algorithms which I have exposed so far.

Optimization Overview

There are different ways to categorize the optimization algorithm. Depends on the objective function, we can have linear or non-linear optimization. Based on the input type, we can have numeric optimization and discrete optimization. There are optimizations with constraints and without any constraints. Depends on the number of objective functions, we can have single and multiple objective optimizations.

1. No constraints and differentiable objective function

The first scenario that comes to my mind is when we have a differentiable objective function without any constraints.

1.1 Gradient Descent

When we are searching the values, Gradient descent tries to go in the direction such that the value of cost function f(x+\delta x) at the next step is smaller than the current one f(x). To find the direction of the movement, we take the derivative of the function at each step, assume the function is differentiable. Depends on how far we move each step, the algorithm can take a long time to converge, or even not converges.

1.2 Newton Method

If the cost function is also twice differnetiable, then we can use newton method, and quasi newton method according to Taylor expansion.

Taylor Expansion

Given a real or complex twice differentiable function f, then the value at point \(x_0\) can be approximated as \( f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f'(x_0)(x-a)^2 \)

The Newton Method, not only takes the direction of the movement but also the velocity(second derivative) into account. Therefore, using the Newton method is more efficient when updating each step. But sometimes we don't have the second derivative.

1.3 Quasi-Newton Method

To solve the problem of the newton method in the case we don't have the second derivative, Quasi-Newton can be used. The main difference is that Quasi-Newton uses an approximation of the second derivative to replace the derivative to do the computation.

1.4 Why not use an analytical solution?

Consider that since we can take the derivatives, why not just set the derivative of the objective function as zero, and then solve analytically. One main reason is that sometimes we have a huge dataset and multiple variables, the computation time can be longer if we need to do matrix transformation, but gradient descent or the newton method is iterative, so it can be less expensive.

2. Not differentiable?

In reality, we do not have such optimistic cases. Not every objective function is differentiable. Consider a discrete case below.

Example, the traveling salesman

The traveling salesman is a classical discrete optimization problem. The salesman starting from city A, and travel N cities, and only one time for each city, and eventually come back to city A, what is the shortest path?

In this case, we can not find an analytical solution. The brute force solution is that we iterate all permutation which has a time complexity of O(N!). There are algorithms we can use here such as simulated annealing, GA, random hill climbing.

I summarize the algorithms below. These algorithms can be effective in discrete cases.

2.1 Genetic Algorithm

Genetic algorithm is one type of evolutionary algorithm. The algorithm uses the idea from biology to mimic natural selection. Take the traveling salesman as an example. The genetic algorithm first randomly generates a population (a set of routes), and then rank the routes by fitness, in this case, it is the shortest distance. The next step is to randomly select two routes as the 'parent route' and pass the elements in each parent route to make a 'child'. This process is known as crossover. To explore more possibilities, the final step is to perform mutation which is randomly select two cities in each parent route to swap with a predefined probability(say 3%) The child serves as the next generation and we repeat to full. Over time, it will generate a better(shorter distance) generation.

Because of the mutation and crossover, We do not always reach the global optimal but can reach the local optimum fairly quickly.

2.2 Simulated Annealing

This algorithm's idea comes from annealing the metal. If we cool the meta fast, then the irons in the meta are randomly spread, but if we cool it slowly, then it will be more structured, and more stable. The algorithms work in the following way. We have an initial temperature, and in the next step, we evaluate the fitness of the route and decide whether to switch to the next possible route with a probability. The probability is associated with temperature. We decrease temperature over time, so we are less likely to back to the previous path. By doing this, we are less likely to be stuck at a local minimum. More likely to reach the global optimum.

2.3 Hill-Climbing with Random Restart

Hill climbing is straightforward as its name suggests. We start with a random path and find the neighbor path, compare it with the current path to see if it is better, if it is, then we select the next path. The problem is also about stuck at a local minimum. Then we introduce random restart into it, so it does not get into local optimum.

3. Optimization with constraints

In reality, we usually have constraints when doing optimization. Based on the constraint type, there are different methods to optimize.

3.1 Lagrange multiplier for Equality constraint only

If the constraint can be expressed as equality, Then we can use Lagrange Multiplier to solve the algorithm. For example, a retail business wants to maximize its profits given certain constraints of the budget. The cost is labor and raw material. Revenue is a function of labor and raw material. In this scenario, we want to maximize the revenue function f. Let x, y denote the labor cost and raw material. Then both f and the cost function g are functions of x and y. We want to max out the budget, thus g(x,y) ideally should be equal to budget (c).

The optimization problem can be formulated as the following.

\[ max f(x,y) \]

given the constraint that \( g(x,y)= c \)

where c is a constant.

We want the coutour to barely touch the constraints. To do that, the vector perpendicular to the tangent line at that intersection point should go the same direction as the gradient of the constraint function.

That is to say,

\[ \nabla f = \lambda \nabla g \]

which is equivalent to

\[ \frac{\partial f}{\partial x} = \lambda \frac{\partial g}{\partial x} \]

\[ \frac{\partial f}{\partial y} = \lambda \frac{\partial g}{\partial y} \]

where \( \lambda \) is a constant. Solve the equation above, we can get the value of x and y.

3.2 Interior point method for inequality constraints

However, the above case is a very strict constraint. There are times we face an inequality constraint. In this case, we can use the interior point method such as the barrier function to convert it to a non-constrain problem and then solve it.

References

[1] https://en.wikipedia.org/wiki/Quasi-Newton_method

[2] https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/lagrange-multipliers-and-constrained-optimization/v/lagrange-multiplier-example-part-1

[3] https://en.wikipedia.org/wiki/Interior-point_method

Are you satisfied with your job as a developer?

Wed, 05 Sep 2018 00:00:00 +0000

Introduction

Tech industry has been booming for years. We hear a lot of stories of peaks such as competitive salaries, work-life balance, unlimited vacation one can get working in a tech company. But are the employees really happy with their jobs? What drives their satisfaction and what makes them to leave?

I used data from Stackoverflow’s 2017 Annual Developer Survey to investigate this problem.

This survey has around 64000 reviews from 213 countries. The survey’s responses are mostly collected from developers and the questions asked in the survey are related to many aspects of developer’s job and career. Some of the aspects covered:

How do they break into this field at the first place?
The developer’s education, especially coding background
The developers' job responsibility and satisfaction
What makes them to looking for new opportunity? What they value most when they look for the next position?
The developers' interaction with Stackoverflow.

Here I am interested in dig dive into data to figure out three problems.

Part I How satisfied are you as a developer ? There is one rating question in the survey asking about the job satisfaction. The answer is rated from 0-10 which 10 represents highly satisfied and 0 represents highly dissatisfied. I first filter out responses with NA values. Below is a table showing the response counts and percentage for each rating.

Job Satisfaction	Rating Counts	Percentage
8.0	8983	22.25%
7.0	7969	19.74%
9.0	5573	13.8%
6.0	4726	11.7%
10.0	4148	10.27%
5.0	3749	9.29%
4.0	1865	4.62%
3.0	1635	4.05%
2.0	888	2.2%
0.0	467	1.16%
1.0	373	0.92%

Here, I used a metric called ‘top 3 box’ to measure satisfaction. A Top 3 Box score summarizes the positive responses from a scale survey question. It combines the highest 3 responses of the scale to create one single number.

Below plot shows the job satisfaction by country using the metric top 3 box. The countries I selected here have a response threshold of 500.

Top countries for satisfaction score are Netherlands, Canada, Sweden and United States. All these four countries has over 50% top 3 box score.

Part II Does salary drive satisfaction? Is there anything else?

There are many factors which can drive job satisfaction, such as salary, health benefits and vacation. To figure out does salary drives job satisfaction. I check the average salary of the top five countries which has the highest average salary.

Below tables shows the average salary of these countries.

Country	Average Salary
United States	86862.40
Canada	60821.54
United Kingdom	56086.99
Germany	44121.32
India	11603.47

The top countries with high average salaries also have a high job satisfaction (except India). Salary does have some impact on the job satisfaction. In addition to salary, does the benefits also influence employees’s satisfaction?

One of the survey’s question is:

When it comes to compensation and benefits, other than base salary, which of the following are most important to you?

The following table shows the counts of each factors people think is most important to them.

Important Benefits	Counts
Vacation/days off	5757
Health benefits	4455
Expected work hours	4288
Remote options	5008
Retirement	2658
Annual bonus	2983
Equipment	4002
Professional development sponsorship	3615
Stock options	1300
Child/elder care	694
Long-term leave	1240
Meals	1258
Other	247
Private office	872
Education sponsorship	1287
Charitable match	199
None of these	82

The top three factosr are vacation, health benefits and expected work hours.

Part III Why people leaved their job? To figure out why people are leaving their job, I took a closer look at below question in the survey.

You said before that you used to code as part of your job, but no longer do. To what extent do you agree or disagree with the following statements?

The top three reasons for people to quit coding are:

I don’t think my coding skills are up to date
If money weren’t an issue, I would take a coding job again
My career is going the way I thought it would 10 years ago

and they counts for 17%, 17% and 15% of the total respectively.

The technical skills is the most essential for developer. One need to keep updated for their coding skills. Just as important as coding skills, money also factors into developer’s career decision. At the same time, some of the developers are looking for a career change and they do want to try out different things. That is also one reason they left their job.

Hello World!

Sat, 25 Aug 2018 00:00:00 +0000

Hello, I finally got my first personal webpage set up!

I will use this website to share some of my projects in statistics, machine learning or programming.

This website is hosted on Github and I used Hugo and Ananke to set up.