Skip to main content


Optimizely’s New Stats Engine – What it means for you

If you regularly use Optimizely for your split testing, you’ll be aware of the new changes to their Stats Engine. It applies to all tests started since 21st January, and the implications are big.

Split testing tools have been abused by many practitioners, often unaware that they are declaring wins with “95% statistical significance” that in reality are much less certain. If you’ve experienced the excitement of implementing winning tests followed by disappointment when they don’t appear to impact your bottom line, then you may have fallen into this trap.

There’s nothing wrong with traditional statistical significance calculations, but there are many pitfalls if you don’t know what you’re doing. Optimizely have concluded “it’s time for statistics to change – not the customers”.

Optimizely’s New Stats Engine aims to be more fool proof. We welcome this effort, but it comes with some serious caveats:

  • You’re less likely to get a win – Optimizely expect 39% fewer wins
  • A winning test that would have been described with 95% confidence under the old system is now only described as having 90% confidence
  • Test speed will change, although it’s not yet completely clear how. Optimizely make positive claims: in some tests you may be able to declare a winner more quickly, but it’s quite possible your tests will take a lot longer

In the rest of this blog, I look under the bonnet at Optimizely’s new statistics engine to explain why. To do this I spent some time getting stuck into their technical papers and working out exactly what this change means to Optimizely’s users. This mathematical digest aims to summarise the key changes and the implications for running and reporting on A/B tests.

Change #1 – Optimizely now allows you to regularly check to see if you have a winner

What does this mean?

Traditional statistics required you to decide on your testing period before you started testing. If you’ve continually peaked at your results and stopped once you had a winner, you’ve probably been misled.

With the new stats engine the days of needing to set a defined period are gone. The stats now hold up, however many chances you give yourself to declare a winner. However that doesn’t account for the need to accurately represent business cycles.

The flipside is that to allow this, the stats now make it harder to get a win. Optimizely estimate that there will be 39% less wins.

It’s also a lot harder to get an early win. Optimizely used to (usually misleadingly) declare huge victories after very little traffic. Thankfully this will be much rarer.


The days of Optimizely declaring ludicrously large wins based on hardly any data should be over

Testing length will also change. Optimizely paint a very positive picture about the speed of tests using the new system, but I’m waiting to see how this works out in practice. For example Optimizely’s old test duration calculator recommended a total sample size of 40,629 per test segment on a site with a conversion rate of 3% to detect a lift to 3.3%. Under the new system their calculator gives an average sample size of 51,141.

What are the implications?

Using Optimizely’s new system you are free to check your results as often as you like, however this doesn’t take into account the need to fairly represent business cycles. We continue to recommend that if you use Optimizely you set 14 days as your absolute minimum test duration (this may need to be longer depending on the length of your user journeys) and then continue to check every subsequent 7 days to see if you have a winner.

Change #2 – Statistical significance is now a 2-tailed t-test

What does this mean?

This is perhaps the most difficult part to understand, but I’ve tried to make it as comprehensive as possible.

Previously Optimizely only asked the question “Does the test produce an uplift?” in order for you to be able to say “We’re 95% confident that this change produces an uplift.” This is called a 1-tailed test because it’s only looking at one end of the probability distribution: the positive end, as shown on the graph below.


Though Optimizely did used to report statistical confidence on ‘negative uplifts’ it was only ever confirming that it wasn’t a positive uplift.

The 2-tailed test asks “How confident are we that the test produces an uplift and, if it’s lower, how confident are we that it’s a definite fall?”. This allows you to say either “We’re 95% confident that this change produces an uplift OR we’re 95% confident that this change produces a fall.”


This allows you to be more confident when you’ve definitely had a bad result so you’re more equipped to learn. But the downside is that accurately checking for statistical significance in both directions requires more data.

What are the implications?

Getting 90% confidence in a 2-tailed test is the equivalent of getting 95% confidence in a 1-tailed test (it takes the same amount of data). So Optimizely have now set the required confidence level to declare victory to 90%.

As a result, you have two options:

  1. You can report that you have 95% confidence of a victory when Optimizely gets to 90%
  2. You can continue to require 95% confidence from the 2-tailed test, which would be the equivalent of shifting the normal 1-tailed bar to 97.5%

As a team, we are discussing the pros and cons of reporting 1-tailed v 2-tailed figures. It the long run, the option that we choose may depend on individual traffic levels and priorities, and therefore our recommendation may vary by website.

Change #3 – Optimizely now takes into account the number of variations

What does this mean?

Using 95% confidence, if you create 20 test variations, one of them would usually give you a lucky win. Optimizely’s New Stats Engine now raises the statistical bar for each variation that you add to cancel this out.

What are the implications?

This means that it gets progressively harder to achieve statistical confidence for each test version that you include. So think carefully before adding more than one variation.

Change #4 – Optimizely now takes into account the number of goals that you track

What does this mean?

Again to stop you from just getting a lucky win, Optimizely takes into account the number of goals tracked. You can, however, bypass this by setting one goal as your Primary Goal in the results interface. This gives you the freedom to observe wider user behaviour without altering the statistical significance of your primary goal.

What are the implications?

Setting a Primary Goal will tell Optimizely that this is the only goal which will allow you to declare victory. Optimizely will then run calculations for this goal separately from your other goals.


This high profile drive for accuracy is a great step forward. I hope it leads to better reporting industry-wide and a fuller realisation of the power of testing to genuinely increase profits.

If you’ve been testing properly, with a solid grasp of how to use statistical significance calculations, then this is another equally valid approach that you can choose. We suspect that the length of testing time will be a significant factor in choosing which system is most appropriate for different businesses.

If this is the first time you’ve heard that traditional statistical significance reports can easily be misleading, or if it’s not something that you’ve fully grasped before, then this is a wake-up call. You’ll find that it’s harder to report a win, but you’ll more consistently see a lasting impact on your profits.

Do you have additional questions about what Optimizely’s New Stats Engine means for you? About to start a test and want to understand more about how these changes apply? Comment below and I’ll be happy to answer any additional questions you may have. Alternately, take a look at the additional resources listed below.

Additional Resources:

To find out more about this update, we recommend the following articles:

About the Author

dave-mullen-optimiserDave is a specialist in using data-driven insight to increase profits. Optimisation projects led by Dave have added over £41m to clients’ turnovers. Follow Dave on Twitter @daveanalyst.

Keep up-to-date

People from Facebook, FarFetch and RS Components receive our newsletter. You can too. Subscribe now.

Interested in turning experimentation and testing into an advantage for your entire business?