Do you have any tips on dealing with the fundamental flaw in all estimates of statistical significance when split testing?
Let's say you aim for 95% confidence level. It means that by definition 1 out of 20 tests will yield pretty much a random result.
Here is an example of one test I'm running right now:
K34 is a real test. CC is a fake. Both groups CC.A and CC.B see identical page.
I usually stick such element in all my tests just as a sanity check.
The columns in the data above represent two consecutive actions by visitors: signup and confirmation.
As you can see, the real test (K34) doesn't seem to make much difference. Both groups sign up and confirm at about the same rate (signup 16%, confirmation 13%).
But groups CC.A and CC.B which see absolutely identical content seem to show different results.
Think of it as a multivariate test with one of the tested variables being invisible.
In this case, K34 represents the intro sentence on the landing page, and it this time it's clear that both of those sentences are equally effective, so I should stick with control.
But, CC shows statistical significance. Both Fisher's test and chi square would place it above 95% confidence level.
So what do you guys do to avoid shit like that?
If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.
Testing for 99% takes too many visitors, and even then, there is no guarantee that results mean shit.
In this particular example, it could be very well true that K34 actually makes difference but my test results simply aren't showing it.
Any tips based on experience?
Let's say you aim for 95% confidence level. It means that by definition 1 out of 20 tests will yield pretty much a random result.
Here is an example of one test I'm running right now:
Code:
K34.A .17660 154/872 .13188 115/872
K34.B .16590 146/880 .13522 119/880
CC.A .15597 141/904 .11836 107/904
CC.B .18909 156/825 .15151 125/825
K34 is a real test. CC is a fake. Both groups CC.A and CC.B see identical page.
I usually stick such element in all my tests just as a sanity check.
The columns in the data above represent two consecutive actions by visitors: signup and confirmation.
As you can see, the real test (K34) doesn't seem to make much difference. Both groups sign up and confirm at about the same rate (signup 16%, confirmation 13%).
But groups CC.A and CC.B which see absolutely identical content seem to show different results.
Think of it as a multivariate test with one of the tested variables being invisible.
In this case, K34 represents the intro sentence on the landing page, and it this time it's clear that both of those sentences are equally effective, so I should stick with control.
But, CC shows statistical significance. Both Fisher's test and chi square would place it above 95% confidence level.
So what do you guys do to avoid shit like that?
If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.
Testing for 99% takes too many visitors, and even then, there is no guarantee that results mean shit.
In this particular example, it could be very well true that K34 actually makes difference but my test results simply aren't showing it.
Any tips based on experience?