Any practical tips on statistical significance?

bcc423

все муда
Dec 16, 2008
667
17
0
Do you have any tips on dealing with the fundamental flaw in all estimates of statistical significance when split testing?

Let's say you aim for 95% confidence level. It means that by definition 1 out of 20 tests will yield pretty much a random result.

Here is an example of one test I'm running right now:

Code:
K34.A    .17660 154/872    .13188 115/872
K34.B    .16590 146/880    .13522 119/880

CC.A    .15597 141/904    .11836 107/904
CC.B    .18909 156/825    .15151 125/825

K34 is a real test. CC is a fake. Both groups CC.A and CC.B see identical page.

I usually stick such element in all my tests just as a sanity check.

The columns in the data above represent two consecutive actions by visitors: signup and confirmation.

As you can see, the real test (K34) doesn't seem to make much difference. Both groups sign up and confirm at about the same rate (signup 16%, confirmation 13%).

But groups CC.A and CC.B which see absolutely identical content seem to show different results.

Think of it as a multivariate test with one of the tested variables being invisible.

In this case, K34 represents the intro sentence on the landing page, and it this time it's clear that both of those sentences are equally effective, so I should stick with control.

But, CC shows statistical significance. Both Fisher's test and chi square would place it above 95% confidence level.

So what do you guys do to avoid shit like that?

If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.

Testing for 99% takes too many visitors, and even then, there is no guarantee that results mean shit.

In this particular example, it could be very well true that K34 actually makes difference but my test results simply aren't showing it.

Any tips based on experience?
 


That's my point.
I just plugged my numbers in there and it gave me 97.8% confidence, which is horse shit.
 
As I said, it's the fundamental flaw with statistics.

95% confidence by definition means that 5 out of 100 times you will get wrong result.

So 5 outs of 100 tests you perform will turn out to be incorrect.
 
95% is the level scientists look for in experiments, if it's good for them it's good enough for me. Trying to be 99% sure is just going waste a lot of your time and money testing ideas that suck.
 
I'd need to sit down and calculate and test myself, but I want to say you could plot data in excel, find a trendline, and look for an R^2 value as close to 1.00 as possible. This is just a super simplistic method though without a any calculation on your part. There's some modeling you could do to really determine effectiveness
 
95% is the level scientists look for in experiments, if it's good for them it's good enough for me.

Well, aside from the fact that most of the scientists are more interested in advancing their publishing career than in actually finding out the truth, your logic has one major flaw.

Let's assume scientists really care about their work and don't do it just to get published so they can get more grants in the future. Let's say they don't pull data out of their asses just to be able to publish a paper.

How many tests does an average scientist run in a year?

Imagine a study of long-term dependence of "regularly scratching your left ear with your right hand" to developing Alzheimer's.

The group that prepares the study spends a couple of years collecting that data, running surveys, getting intermediate funding to pay for the research, attending conferences and do all other kinds of shit.

The result is they run one fucking test after performing several years of work.
And if they can establish statistical significance, then during the next several years they'll be busy riding on their "success".

That's one test in 5 years or more.

When you test so rarely, you can accept the 5% possibility of being wrong. Simply because 5% is a pretty small possibility.

But if you are a marketer and you test constantly, then you can't be wrong 5% of the time. If you run 10 tests per week, that means you'll get wrong results once every other week.
 
find a trendline

How do you find a trend in this context?

The only thing I can think of is getting a set of "how many visitors it took for the next conversion".

Plotting that pretty much presents a wiggling line.

On longer term, it would converge, but the trend will change a bunch of times before it happens.
 
If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.

Actually - only kind of.

It's true that 1 in 20 tests will fall outside the parameters - but it doesn't mean it's worthless. It means that is what is actually happening in real life.

Since you wrote this I can tell you have a very good grasp on stats - but you are getting too hung up on the math and not enough on the reality.

If the true test did reveal slight difference where the control was better I'd just move on and test something else. You are dealing with such minor variations overall that there are probably better things to move on to split testing - just shelve this one for a while, or leave it in place until you get enough visitors to get to the higher confidence level.
 
If the true test did reveal slight difference where the control was better I'd just move on and test something else. You are dealing with such minor variations overall that there are probably better things to move on to split testing - just shelve this one for a while, or leave it in place until you get enough visitors to get to the higher confidence level.
The situation I'm describing is the opposite. The "fake" test shows statistical significance. The two versions of a page are identical.

It just illustrates in practice that whole theory of 95% confidence.

Right now, I usually rely on my gut instinct when looking at the results.
If something doesn't look right, I scrap the test and run it again. (or decide not to run it and move on).

But the reason I posted it is to collect some ideas on how people deal with it.
 
How many times have you run the CC test, and is it usually (or often) showing statistically sig. results? I have never done that but I'd be interested in knowing. Have you done so in a single variable test (ie not included in a larger multivariate test, but in a test completely by itself)?

If you are doing just multivariate, is this full factorial or fractional (taguchi, etc) setup? Maybe the recipes that came out of the fractional test are skewed towards showing certain levels over others.

At the end of the day, as you said even at 95% confidence these kinds of (type 1) errors are gonna happen. If its happening EVERY time, then maybe its some kind of reporting or tracking error?

Have you looked at power, effect size? Remember that just cause a result is statistically significant, that doesn't mean that its necessarily meaningful.
 
How many times have you run the CC test, and is it usually (or often) showing statistically sig. results? I have never done that but I'd be interested in knowing. Have you done so in a single variable test (ie not included in a larger multivariate test, but in a test completely by itself)?
I always run CC side by side with other tests.
I don't see such difference every time. If I did, I would be sure that something is wrong with my tracking system. I verified the system many times over the years. Pretty much every time I see something weird, I grow paranoid that my tracking is messed up and I start running controlled tests checking http headers, cookie values, etc.

If you are doing just multivariate, is this full factorial or fractional (taguchi, etc) setup?
I don't believe in taguchi in general. Or at least I don't believe in using it for split testing.
And separately, I don't practice "guess which combinations of which elements we should test since we can't test them all" approach :) Not a bit fan of the f-word :)

So that leaves it to the basic a/b testing.

My tracking is set up for completely independent multivariate testing with any kind of setup you can imagine. (I kept it generic on purpose.) I just don't use it for anything but simple a/b any more.

With the data I posted above, I can get the following groups:

K34.A,CC.A
K34.A,CC.B
K34.B,CC.A
K34.B,CC.B

But since CC doesn't control any visible elements, I just omit it.

At the end of the day, as you said even at 95% confidence these kinds of (type 1) errors are gonna happen. If its happening EVERY time, then maybe its some kind of reporting or tracking error?
I see in practice what I expect to see according to theory. One out of 20 (well, I didn't count the exact number) really is showing difference when it shouldn't.
Extending that, we can conclude that the same happens with the opposite: when there really is a difference we don't see it.

Have you looked at power, effect size? Remember that just cause a result is statistically significant, that doesn't mean that its necessarily meaningful.
Well, I posted raw numbers, not just percentages. So have a go at it with any formulas you can imagine.

As for "statistical significance" vs. "real-world significance" -- that's something inapplicable in our case.

Since any difference in effectiveness of the sales page has "real-world" significance for us.

In the non-copywriting world, there are plenty of dependencies that are statistically significant but completely useless. And I think that's what you are referring to.

But sales page split-testing, there is no such thing. If something _really_ increases your conversion rate, it can't possibly be useless.

The question is what to do about the fact that we can never be sure if what we see in the results reperesents the true state of things.
 
Here is more weirdness for you:

Code:
K34.A,CC.A    .17977 80/445    .13033 58/445
K34.A,CC.B    .17349 72/415    .13493 56/415

K34.B,CC.A    .13289 61/459    .10675 49/459
K34.B,CC.B    .20487 84/410    .16829 69/410
CC is a fake, K34 is real.

Yet, according to the data, people who've seen the intro sentence of K34.A weren't affected by the invisible variable CC. But people who've seen the intro sencence K34.B, were greatly affected.

If CC was a real variable (let's say the headline), I would conclude that
When the intro sentence is K34.A, it draws peoples attention away from the headline.
But when the intro sentence is K34.B, then people pay attention to the headline and the version CC.B of the headline is much better than version CC.A.

But since I know that CC.A and CC.B are completely identical, I can only sit here and scratch my head knowing that there are some things I don't know :)
 
Aim for 98% and you're only 1/50 wrong. If you accidentally toss away one good landing page out of 50, who cares.
 
And to boot, take what I posted above:

Code:
K34.B,CC.A    .13289 61/459    .10675 49/459
K34.B,CC.B    .20487 84/410    .16829 69/410

And add them up.

You'll get the right column:

49+69=118
459+410=869

118/869=0.135788262

As expected. The rate is 13.5% percent.
It just shows how the "dice rolled". It simply happened that of all the people who confirmed, more happened to be in the group CC.B by random chance.
 
Aim for 98% and you're only 1/50 wrong. If you accidentally toss away one good landing page out of 50, who cares.

I just plugged the data for K34.B,CC.A vs K34.B,CC.B into the"Split Test Accelerator" that was linked above.

I got 99.6%.
 
The point is we don't know shit.

Think about it, were you ever able to continuously increase the conversion rate of any sales page? Or did you get stuck at some point?

And when you did get stuck, what did it look like?
Were you constantly finding better versions and constantly updating your control, yet for some reason the overall (historic) conversion rate didn't continue to increase?

Sounds familiar?

It's as if you are constantly improving (according to each test you run), yet that overall improvement doesn't materialize on the greater scale. Why is that?
 
Well, I posted raw numbers, not just percentages. So have a go at it with any formulas you can imagine.

As for "statistical significance" vs. "real-world significance" -- that's something inapplicable in our case.

Since any difference in effectiveness of the sales page has "real-world" significance for us.

In the non-copywriting world, there are plenty of dependencies that are statistically significant but completely useless. And I think that's what you are referring to.

But sales page split-testing, there is no such thing. If something _really_ increases your conversion rate, it can't possibly be useless.


Thats true, but that doesn't mean there isn't some difference in populations that is giving you this error - like you said, the more responsive people were randomly assigned into CC.B. Thats why effect size is used, which I would wager for your test is very small (can't compute it without SD). Any test with a large enough n will give you a significant result, although not necessarily a meaningful one.

But you're right in that we don't know shit, and these kinds of tests will never tell us with absolute certainty version X beat control. But a 5% error rate, in the grand scheme of things, isn't a huge deal :) That time is better spent getting new revenue sources coming in.
 
Are you talking about signal vs noise for confidence=(signal/noise)*sqrt(size)?

I believe that's the formula. Or something close to that.

Or is it something else?
 
Just for the record, I agree that time is better spent increasing the volume. But still, this makes me wonder.