When A/B Testing gets hairy: Multiple Metrics
A/B Testing is a statistical method of comparing two variants of a product to determine which one performs better. It’s become extremely popular in recent years, and if you’d like to learn more you can read Optimizely’s introduction here. It’s dead simple to do, and extremely effective when min/max -ing one metric.
The most common ways of performing an A/B test is with third party tools and websites such as Optimizely, Google Analytics Experiments, or simply rolling your own.
1. Simple: Single action or metric page, e.g. Landing pages
Extremely focused pages that only have one narrow action such as simple landing pages benefit extremely well from A/B Testing. In the above example, our null hypothesis is that Variant A & B perform equally well. It is easy to optimize for one metric such or conversion rate or revenue. This is the primary use case for most A/B testing tools and you should have little problems analysing test results. After testing you’ll get data similar to below picture, its relatively easy to make a decision on this data once you hit a certain significance. Tools like Google’s content experiments will even adjust itself to maximise conversions during the test.
2. Hairy: Multi-metric pages.
When we get to changes or websites that don’t have one clear metric to optimize against, it begins to get a bit hairy. E.g. Let’s say we wanted to A/B test the effect of adding advertisements into Gmail.
There are too many different use cases for Gmail to optimize for one metric. Revenue is a poor metric to optimize for as well, as we really want to see if user engagement drops after seeing ads.
As a Product Manager, you could feasibly want the following a/b test metrics:
User Engagement
- Email actions taken: Created, Sent, Opens, Deleted
- Non-email actions taken: Favourite, Add Contact, Account Deletes, etc.
- Technical: Average Load Time
- Behavioural: Time on page, % Bounces As well as one-sided metrics from the variant with ads such as clicks, impressions, etc.
From a technical perspective tools like Optimizely and google content experiments do allow for multiple metrics. However you need to add the individual events into the third party tools, which can be a hassle, and content experiments still requires you to optimize (auto adjust splits) against one metric. When rolling your own A/B test framework, it’s deal to have every test run through every relevant metric.
The main complexity comes in the analysis stage. In the ideal world, all our A/B tests will improve every metric across the board, but sometimes the results are mixed. If you were the PM for Gmail, what % loss of user engagement would you tolerate in adding adverts? 1%? 2%? 5?