If you regularly use Optimizely for your split testing, you’ll be aware of the new changes to their Stats Engine. It applies to all tests started since 21st January, and the implications are big.
Split testing tools have been abused by many practitioners, often unaware that they are declaring wins with “95% statistical significance” that in reality are much less certain. If you’ve experienced the excitement of implementing winning tests followed by disappointment when they don’t appear to impact your bottom line, then you may have fallen into this trap.
There’s nothing wrong with traditional statistical significance calculations, but there are many pitfalls if you don’t know what you’re doing. Optimizely have concluded “it’s time for statistics to change – not the customers”.
Optimizely’s New Stats Engine aims to be more fool proof. We welcome this effort, but it comes with some serious caveats:
In the rest of this blog, I look under the bonnet at Optimizely’s new statistics engine to explain why. To do this I spent some time getting stuck into their technical papers and working out exactly what this change means to Optimizely’s users. This mathematical digest aims to summarise the key changes and the implications for running and reporting on A/B tests.
Traditional statistics required you to decide on your testing period before you started testing. If you’ve continually peaked at your results and stopped once you had a winner, you’ve probably been misled.
With the new stats engine the days of needing to set a defined period are gone. The stats now hold up, however many chances you give yourself to declare a winner. However that doesn’t account for the need to accurately represent business cycles.
The flipside is that to allow this, the stats now make it harder to get a win. Optimizely estimate that there will be 39% less wins.
It’s also a lot harder to get an early win. Optimizely used to (usually misleadingly) declare huge victories after very little traffic. Thankfully this will be much rarer.
The days of Optimizely declaring ludicrously large wins based on hardly any data should be over
Testing length will also change. Optimizely paint a very positive picture about the speed of tests using the new system, but I’m waiting to see how this works out in practice. For example Optimizely’s old test duration calculator recommended a total sample size of 40,629 per test segment on a site with a conversion rate of 3% to detect a lift to 3.3%. Under the new system their calculator gives an average sample size of 51,141.
Using Optimizely’s new system you are free to check your results as often as you like, however this doesn’t take into account the need to fairly represent business cycles. We continue to recommend that if you use Optimizely you set 14 days as your absolute minimum test duration (this may need to be longer depending on the length of your user journeys) and then continue to check every subsequent 7 days to see if you have a winner.
This is perhaps the most difficult part to understand, but I’ve tried to make it as comprehensive as possible.
Previously Optimizely only asked the question “Does the test produce an uplift?” in order for you to be able to say “We’re 95% confident that this change produces an uplift.” This is called a 1-tailed test because it’s only looking at one end of the probability distribution: the positive end, as shown on the graph below.
Though Optimizely did used to report statistical confidence on ‘negative uplifts’ it was only ever confirming that it wasn’t a positive uplift.
The 2-tailed test asks “How confident are we that the test produces an uplift and, if it’s lower, how confident are we that it’s a definite fall?”. This allows you to say either “We’re 95% confident that this change produces an uplift OR we’re 95% confident that this change produces a fall.”
This allows you to be more confident when you’ve definitely had a bad result so you’re more equipped to learn. But the downside is that accurately checking for statistical significance in both directions requires more data.
Getting 90% confidence in a 2-tailed test is the equivalent of getting 95% confidence in a 1-tailed test (it takes the same amount of data). So Optimizely have now set the required confidence level to declare victory to 90%.
As a result, you have two options:
As a team, we are discussing the pros and cons of reporting 1-tailed v 2-tailed figures. It the long run, the option that we choose may depend on individual traffic levels and priorities, and therefore our recommendation may vary by website.
Using 95% confidence, if you create 20 test variations, one of them would usually give you a lucky win. Optimizely’s New Stats Engine now raises the statistical bar for each variation that you add to cancel this out.
This means that it gets progressively harder to achieve statistical confidence for each test version that you include. So think carefully before adding more than one variation.
Again to stop you from just getting a lucky win, Optimizely takes into account the number of goals tracked. You can, however, bypass this by setting one goal as your Primary Goal in the results interface. This gives you the freedom to observe wider user behaviour without altering the statistical significance of your primary goal.
Setting a Primary Goal will tell Optimizely that this is the only goal which will allow you to declare victory. Optimizely will then run calculations for this goal separately from your other goals.
This high profile drive for accuracy is a great step forward. I hope it leads to better reporting industry-wide and a fuller realisation of the power of testing to genuinely increase profits.
If you’ve been testing properly, with a solid grasp of how to use statistical significance calculations, then this is another equally valid approach that you can choose. We suspect that the length of testing time will be a significant factor in choosing which system is most appropriate for different businesses.
If this is the first time you’ve heard that traditional statistical significance reports can easily be misleading, or if it’s not something that you’ve fully grasped before, then this is a wake-up call. You’ll find that it’s harder to report a win, but you’ll more consistently see a lasting impact on your profits.
Do you have additional questions about what Optimizely’s New Stats Engine means for you? About to start a test and want to understand more about how these changes apply? Comment below and I’ll be happy to answer any additional questions you may have. Alternately, take a look at the additional resources listed below.
To find out more about this update, we recommend the following articles:
About the Author
Dave is a specialist in using data-driven insight to increase profits. Optimisation projects led by Dave have added over £41m to clients’ turnovers. Follow Dave on Twitter @daveanalyst.
Posted in: CRO Tools and Resources
Sign up to our newsletter and get all of the latest news straight to you.
If you’re serious about initiating change within your business, we’d like to offer you a 60-minute Initial Strategic Review.
“We’ll share what we’ve learned from decades of experience working with businesses using optimisation, innovation and experimentation to achieve business goals like yours”