A/B Testing |

What is the problem? The Urban Outfitters platform is seeking to update their subscription service. The Business Lead for the Subscription workstream believes that increasing user sign-ups for purchases could enable personalized content delivery and yield deeper insights into user preferences based on location and interests. However, the Product Management and UX/UI teams have identified the current sign-up process as lengthy and complex. They suggest redesigning it to streamline the experience. This redesign would require front-end and back-end changes by the data engineering team, including updates to the data schema for form data collection, login and authentication processes, as well as user experience and UI design.

These modifications entail considerable effort. Therefore, the business seeks assurance that the proposed update will indeed boost subscription rates. Will the sign-up redesign impact subscription numbers? What would be the extent of this impact? Is the effort justified by the expected outcome? These are critical questions that need to be addressed.

What is the solution? By analyzing data and conducting A/B testing, we can determine the potential impact of the sign-up redesign on subscription rates. This analytical approach ensures that decisions are informed and aligned with the overarching goal of enhancing user engagement and driving business growth.

We have outlined 5 key steps:

Understanding the Business Setting up the Hypothesis Designing the Experiment Running the Experiment Identifying Faulty Experiments Statistical Inference Launch Decision

Understanding the Business

This step involves two critical aspects:

Firstly, we examine the business, product, application, and experimentation goals, along with potential metrics and key performance indicators (KPIs). Ultimately, our aim is to enhance the application’s quality, thereby improving both the product and business. For instance, the business goal may revolve around providing an exceptional purchasing experience across physical and online retail stores, while the product goal could be centered on delivering a top-notch, personalized ecommerce platform. As for experimentation, our goal might involve launching a redesigned sign-up process to enhance subscriber growth.

To achieve our business goal, we must define a North Star Metric (NSM) that aligns with our overarching objectives. This metric should deliver value to users while contributing to overall business profitability and long-term growth. In our scenario, we designate the NSM as the ”# of subscribers.”

In addition to the NSM, we identify a driver metric, which serves as our primary KPI at the application level. Considering our product is the ecommerce platform and the application is the sign-up form app, our primary KPI could be “Average sign-ups per day.”

However, relying solely on the driver metric may not provide a comprehensive understanding. Therefore, we define secondary metrics to further elucidate the driver metric, along with guardrail metrics essential for validating the experiment’s integrity. Secondary metrics might include “Average time spent on the sign-up form” and “Average time spent in the overall application,” while guardrail metrics could involve Sample Ratio Mismatch and AA tests.

Furthermore, segmenting the metrics by factors such as age, location, browser, and device allows for a more granular analysis.

We’ll begin by loading libraries, establishing a connection to Google BigQuery (GBQ), creating a data frame, and examining its columns.

Visitor_id: Unique identifier for each user.

Date: Date of the user’s visit to the sign-up form.

Experiment: Indicates whether the user was exposed to an experiment.

Group: Assigns users to experimental groups if they were exposed to an experiment.

Email: User’s email address if they signed up.

Our analysis starts with general statistics, encompassing measures of central tendency, dispersion, distribution, and more for the submitted data.

First, let’s address missing values. Notably, approximately 90% of the experiment values are missing, reflecting that the majority of users haven’t participated in experiments. However, the remaining 10% were part of the AA Test, which we’ll delve into later.

With a total visitor count of 309,903 and 31,295 sign-ups, we calculate an overall sign-up rate of 0.1.

Next, we examine trends in visitor count and sign-up rate per day. The mean sign-up rate per day hovers around 0.1005, while the mean visitor count per day stands at approximately 10,000

Hypothesis Set up

To establish a hypothesis, four key elements must be defined:

Hypothesis Statement Statistical Significance Level Statistical Power Minimum Detectable Effect (MDE)

Hypothesis Statement

Business hypotheses originate from the Product Manager, asserting that a change in application features will improve functionality, substantiated by justification. For instance, the Product Manager posits that simplifying the sign-up steps will increase registration rates due to enhanced user ease.

Statistical hypotheses are formulated by the Data Scientist, delineating the null hypothesis (H(o)) and alternative hypothesis (H(A)). In our case:

H(o): The average sign-up rate per day is the same between the old and the new sign-up design.

H(A): The average sign-up rate per day is different between the with the new and old sign-up design.

Statistical Significance

Significance level denotes the strength of evidence required to reject the null hypothesis (H(o)). This threshold determines whether the p-value from statistical tests (z & t tests) is lower than the significance level. For our ecommerce platform sign-up design A/B testing, a significance level of 0.05 is set. Adjustments can be made based on risk factors; for instance, higher risk experiments may warrant a significance level of 0.01.

Statistical Power

Suppose we’ve conducted our t-test and obtained a p-value below the threshold for statistical significance. How can we determine if an effect truly exists? Statistical power comes into play here—it represents the likelihood of rejecting the null hypothesis (H(o)) when an effect is indeed present. It’s crucial to establish a predefined threshold for statistical power to guide our analysis effectively.

For instance, consider a scenario where we’ve set our significance level at 0.05 and our predefined statistical power at 0.10. Upon conducting the test, if we find that the p-value exceeds the threshold for statistical significance, we may fail to reject the null hypothesis (H(o). However, it’s important to note that in this case, we’ve set a low probability of detecting the effect, leading to uncertainty regarding our conclusion due to low statistical power.

Here’s another scenario to illustrate: Suppose our significance level is set at 0.05, and our predefined statistical power is 0.80, indicating a high probability of detecting an effect. After conducting the test, if we find that the p-value falls below the threshold for statistical significance, we confidently reject the null hypothesis (H(o) and conclude that there is indeed an effect present.

In practical application, we typically aim for a high probability of detecting an effect, often setting the statistical power at 0.80. However, if during testing we encounter a scenario where the p-value exceeds the threshold for statistical significance, yet we still suspect the presence of an effect, we have the option to increase the statistical power to 0.90 and rerun the test.

Minimum Detectable Effect (MDE)

MDE defines the minimum effect size deemed significant for practical application. It provides context for interpreting experimental outcomes. Observations below or above the MDE signal practical or statistical significance, respectively. Although related, practical significance differs from statistical significance and requires scrutiny of p-values and significance levels.

Designing Experiment

Three facets guide experiment design:

Randomization Unit: Users are randomly allocated to control (A) and treatment (B) groups to mitigate bias and ensure equitable baseline conditions.

Sample Size Calculation: Sample size is calculated based on predefined thresholds for significance level, power, and effect size, ensuring experiment validity.

Experimentation Duration: The total duration for data collection is determined, considering factors like day of the week effects. Experiments typically span less than 14 days, balancing observation period and practical considerations.

A common misconception is equating experimentation duration with user observation time; it’s crucial to delineate a fixed observation horizon for accurate analysis, ideally within 1-2 weeks to minimize external factors.

For a 21-day experiment, 1429.0 users are required per day

For a 14-day experiment, 2143.0 users are required per day

For a 7-day experiment, 4286.0 users are required per day

Running the Experiment

Lets run the experiment for Urban Outfitters.

We can see the Control group(A) sign up rate is 0.09557 and the treatment(B) sign up rate is 0.1078.

Before we dive into conducting statistical inference, we need to make sure the experiment was not faulty.

Identifying Faulty Experiments

Various factors influence the validity of experiments, including:

Novelty Effect: Users may resist change initially, affecting their response to new features or designs.

Interference between Groups: Features like recommenders can inadvertently impact different experimental groups.

Holiday Effect: Seasonal variations may confound experimental results.

In our scenario, we focus on mitigating potential issues such as Sample Ratio Mismatch and AA Test.

AA Test

The AA Test plays a crucial role in ensuring the comparability of experimental groups. It employs a Chi-Squared Test to assess whether the groups exhibit significant differences, with the null hypothesis (H(o)) stating that the groups are identical, and the alternative hypothesis (H(a)) positing differences between the groups. A significant outcome in the AA Test alerts to possible flaws in the experimental setup, signaling the need for further investigation.

Let’s run an AA test in the Urban Outfitters data

Control sign up rate is 0.101 and treatment sign up rate is 0.0988. Looking at the sign up rates per day.

Lets run the Chi-Square test.

Ho: The sign-up rates between blue and green are the same.

Ha: The sign-up rates between blue and green are different.

Significance level: 0.05

Chi-Square = 0.577 | P-value = 0.448

Conclusion:

Fail to reject Ho. Therefore, proceed with the AB test.

Sample Ratio Mismatch (SRM)

Let’s further check if the sample ratio is 1:1 or not.

Ho: The ratio of samples is 1:1.

Ha: The ratio of samples is not 1:1.

Significance level: 0.05

Chi-Square = 1.290 | P-value = 0.256

Conclusion:

Fail to reject Ho. Therefore, there is no SRM.

Statistical Inference

Upon completion of the experiment, conducting statistical inference is essential to either accept or reject the null hypothesis (H(o)). Typically, this involves conducting either a Z-test or a T-test and examining the Confidence Interval.

The Z-test is a statistical procedure utilized to assess whether two samples originate from distinct populations in terms of proportion, whereas the T-test evaluates differences in means between two samples drawn from separate populations. The Z-distribution closely mirrors the Normal Distribution, which serves as the basis for many statistical inferences. On the other hand, the T-distribution exhibits greater variability, particularly evident with smaller sample sizes. However, as sample sizes increase, the T-test effectively converges towards a Z-test.

In our context, we will execute a Chi-Square and T-test, and scrutinize the resulting confidence interval to inform our decision-making process.

Chi-Square Test

Ho: The sign-up rates between blue and green are the same.

Ha: The sign-up rates between blue and green are different.

Significance level: 0.05

Chi-Square = 12.312 | P-value = 0.000

Conclusion:

Reject Ho and conclude that there is statistical significance in the difference of sign-up rates between blue and green buttons.

T Test

Ho: The sign-up rates between blue and green are the same.

Ha: The sign-up rates between blue and green are different.

Significance level: 0.05

T-Statistic = 3.509 | P-value = 0.000

Conclusion:

Reject Ho and conclude that there is statistical significance in the difference of sign-up rates between blue and green buttons.

Let’s look at the Confidence Internal of the Test.

Sample Sizes:

Control: 14942

Treatment: 15139

Control: 1428 (9.6%)

Treatment: 1632 (10.8%)

Differences:

Absolute: 0.0122

Relative (lift): 12.8%

T Stats

Test Statistic: 3.509475

P-Value: 0.00045

Confidence Intervals:

Absolute Difference CI: (0.005, 0.019)

Relative Difference (lift) CI: (5.7%, 19.9%)

Launch Decision

To determine whether to proceed with the new sign-up design, we assess the absolute difference between the parameters of the two samples. In practical terms, we often consider the relative difference, which is calculated by multiplying the absolute difference by 100%.

For clarity, the decision rule based on Metric Outcome and P-Value is outlined below:

1- Positive Lift with no statistical and practical significance => Do not Launch

2- Positive Lift with statistical and practical significance => Launch

3- Negative lift with statistical significance => Do not Launch

4- Positive lift with no statistical significance but the upper bound of confidence interval is positive => Rerun with higher power

5- Positive lift with statistical significance but the lower bound of confidence interval is below the defined minimum detectable effect => rerun the experiment with increased power.

When we look at our example of Urban Outfitters a/b testing; We observed an improvement of 12.8% lift from the benchmark (blue) at 9.6%. The result was statistically significant with a 95% confidence interval between 5.7% and 19.9%.

Given these results we recommend to launch the update of the sign up form.

A/B testing is pivotal for data-driven decision-making across products. By comparing variants, it optimizes user experiences and enhances business outcomes. Ultimately, embracing A/B testing drives agility, innovation, and sustainable growth.

Linkedin Article: https://www.linkedin.com/pulse/ab-testing-neal-akyildirim-ar1oe/