Any practical tips on statistical significance?

bcc423 · Jan 1, 2010

Do you have any tips on dealing with the fundamental flaw in all estimates of statistical significance when split testing?

Let's say you aim for 95% confidence level. It means that by definition 1 out of 20 tests will yield pretty much a random result.

Here is an example of one test I'm running right now:

Code:

K34.A    .17660 154/872    .13188 115/872
K34.B    .16590 146/880    .13522 119/880

CC.A    .15597 141/904    .11836 107/904
CC.B    .18909 156/825    .15151 125/825

K34 is a real test. CC is a fake. Both groups CC.A and CC.B see identical page.

I usually stick such element in all my tests just as a sanity check.

The columns in the data above represent two consecutive actions by visitors: signup and confirmation.

As you can see, the real test (K34) doesn't seem to make much difference. Both groups sign up and confirm at about the same rate (signup 16%, confirmation 13%).

But groups CC.A and CC.B which see absolutely identical content seem to show different results.

Think of it as a multivariate test with one of the tested variables being invisible.

In this case, K34 represents the intro sentence on the landing page, and it this time it's clear that both of those sentences are equally effective, so I should stick with control.

But, CC shows statistical significance. Both Fisher's test and chi square would place it above 95% confidence level.

So what do you guys do to avoid shit like that?

If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.

Testing for 99% takes too many visitors, and even then, there is no guarantee that results mean shit.

In this particular example, it could be very well true that K34 actually makes difference but my test results simply aren't showing it.

Any tips based on experience?

Affiliati · Jan 1, 2010

I use Split Test Accelerator: Faster Ad Testing Now! and test to 95%

bcc423 · Jan 1, 2010

That's my point.
I just plugged my numbers in there and it gave me 97.8% confidence, which is horse shit.

bcc423 · Jan 1, 2010

As I said, it's the fundamental flaw with statistics.

95% confidence by definition means that 5 out of 100 times you will get wrong result.

So 5 outs of 100 tests you perform will turn out to be incorrect.

aeisn · Jan 1, 2010

95% is the level scientists look for in experiments, if it's good for them it's good enough for me. Trying to be 99% sure is just going waste a lot of your time and money testing ideas that suck.

papajohn56 · Jan 1, 2010

I'd need to sit down and calculate and test myself, but I want to say you could plot data in excel, find a trendline, and look for an R^2 value as close to 1.00 as possible. This is just a super simplistic method though without a any calculation on your part. There's some modeling you could do to really determine effectiveness

bcc423 · Jan 1, 2010

95% is the level scientists look for in experiments, if it's good for them it's good enough for me.

Well, aside from the fact that most of the scientists are more interested in advancing their publishing career than in actually finding out the truth, your logic has one major flaw.

Let's assume scientists really care about their work and don't do it just to get published so they can get more grants in the future. Let's say they don't pull data out of their asses just to be able to publish a paper.

How many tests does an average scientist run in a year?

Imagine a study of long-term dependence of "regularly scratching your left ear with your right hand" to developing Alzheimer's.

The group that prepares the study spends a couple of years collecting that data, running surveys, getting intermediate funding to pay for the research, attending conferences and do all other kinds of shit.

The result is they run one fucking test after performing several years of work.
And if they can establish statistical significance, then during the next several years they'll be busy riding on their "success".

That's one test in 5 years or more.

When you test so rarely, you can accept the 5% possibility of being wrong. Simply because 5% is a pretty small possibility.

But if you are a marketer and you test constantly, then you can't be wrong 5% of the time. If you run 10 tests per week, that means you'll get wrong results once every other week.

bcc423 · Jan 1, 2010

find a trendline

How do you find a trend in this context?

The only thing I can think of is getting a set of "how many visitors it took for the next conversion".

Plotting that pretty much presents a wiggling line.

On longer term, it would converge, but the trend will change a bunch of times before it happens.

Sharksfan · Jan 1, 2010

bcc423 said:
If you pick 95% that means you can throw one out of 20 tests out the window, since the results would be worthless.

Actually - only kind of.

It's true that 1 in 20 tests will fall outside the parameters - but it doesn't mean it's worthless. It means that is what is actually happening in real life.

Since you wrote this I can tell you have a very good grasp on stats - but you are getting too hung up on the math and not enough on the reality.

If the true test did reveal slight difference where the control was better I'd just move on and test something else. You are dealing with such minor variations overall that there are probably better things to move on to split testing - just shelve this one for a while, or leave it in place until you get enough visitors to get to the higher confidence level.

bcc423 · Jan 1, 2010

If the true test did reveal slight difference where the control was better I'd just move on and test something else. You are dealing with such minor variations overall that there are probably better things to move on to split testing - just shelve this one for a while, or leave it in place until you get enough visitors to get to the higher confidence level.

The situation I'm describing is the opposite. The "fake" test shows statistical significance. The two versions of a page are identical.

It just illustrates in practice that whole theory of 95% confidence.

Right now, I usually rely on my gut instinct when looking at the results.
If something doesn't look right, I scrap the test and run it again. (or decide not to run it and move on).

But the reason I posted it is to collect some ideas on how people deal with it.

7figures · Jan 1, 2010

How many times have you run the CC test, and is it usually (or often) showing statistically sig. results? I have never done that but I'd be interested in knowing. Have you done so in a single variable test (ie not included in a larger multivariate test, but in a test completely by itself)?

If you are doing just multivariate, is this full factorial or fractional (taguchi, etc) setup? Maybe the recipes that came out of the fractional test are skewed towards showing certain levels over others.

At the end of the day, as you said even at 95% confidence these kinds of (type 1) errors are gonna happen. If its happening EVERY time, then maybe its some kind of reporting or tracking error?

Have you looked at power, effect size? Remember that just cause a result is statistically significant, that doesn't mean that its necessarily meaningful.

bcc423 · Jan 1, 2010

How many times have you run the CC test, and is it usually (or often) showing statistically sig. results? I have never done that but I'd be interested in knowing. Have you done so in a single variable test (ie not included in a larger multivariate test, but in a test completely by itself)?

I always run CC side by side with other tests.
I don't see such difference every time. If I did, I would be sure that something is wrong with my tracking system. I verified the system many times over the years. Pretty much every time I see something weird, I grow paranoid that my tracking is messed up and I start running controlled tests checking http headers, cookie values, etc.

If you are doing just multivariate, is this full factorial or fractional (taguchi, etc) setup?

I don't believe in taguchi in general. Or at least I don't believe in using it for split testing.
And separately, I don't practice "guess which combinations of which elements we should test since we can't test them all" approach

Not a bit fan of the f-word

So that leaves it to the basic a/b testing.

My tracking is set up for completely independent multivariate testing with any kind of setup you can imagine. (I kept it generic on purpose.) I just don't use it for anything but simple a/b any more.

With the data I posted above, I can get the following groups:

K34.A,CC.A
K34.A,CC.B
K34.B,CC.A
K34.B,CC.B

But since CC doesn't control any visible elements, I just omit it.

At the end of the day, as you said even at 95% confidence these kinds of (type 1) errors are gonna happen. If its happening EVERY time, then maybe its some kind of reporting or tracking error?

I see in practice what I expect to see according to theory. One out of 20 (well, I didn't count the exact number) really is showing difference when it shouldn't.
Extending that, we can conclude that the same happens with the opposite: when there really is a difference we don't see it.

Have you looked at power, effect size? Remember that just cause a result is statistically significant, that doesn't mean that its necessarily meaningful.

Well, I posted raw numbers, not just percentages. So have a go at it with any formulas you can imagine.

As for "statistical significance" vs. "real-world significance" -- that's something inapplicable in our case.

Since any difference in effectiveness of the sales page has "real-world" significance for us.

In the non-copywriting world, there are plenty of dependencies that are statistically significant but completely useless. And I think that's what you are referring to.

But sales page split-testing, there is no such thing. If something _really_ increases your conversion rate, it can't possibly be useless.

The question is what to do about the fact that we can never be sure if what we see in the results reperesents the true state of things.

bcc423 · Jan 1, 2010

Here is more weirdness for you:

Code:

K34.A,CC.A    .17977 80/445    .13033 58/445
K34.A,CC.B    .17349 72/415    .13493 56/415

K34.B,CC.A    .13289 61/459    .10675 49/459
K34.B,CC.B    .20487 84/410    .16829 69/410

CC is a fake, K34 is real.

Yet, according to the data, people who've seen the intro sentence of K34.A weren't affected by the invisible variable CC. But people who've seen the intro sencence K34.B, were greatly affected.

If CC was a real variable (let's say the headline), I would conclude that
When the intro sentence is K34.A, it draws peoples attention away from the headline.
But when the intro sentence is K34.B, then people pay attention to the headline and the version CC.B of the headline is much better than version CC.A.

But since I know that CC.A and CC.B are completely identical, I can only sit here and scratch my head knowing that there are some things I don't know

force-11 · Jan 1, 2010

Aim for 98% and you're only 1/50 wrong. If you accidentally toss away one good landing page out of 50, who cares.

bcc423 · Jan 1, 2010

And to boot, take what I posted above:

Code:

K34.B,CC.A    .13289 61/459    .10675 49/459
K34.B,CC.B    .20487 84/410    .16829 69/410

And add them up.

You'll get the right column:

49+69=118
459+410=869

118/869=0.135788262

As expected. The rate is 13.5% percent.
It just shows how the "dice rolled". It simply happened that of all the people who confirmed, more happened to be in the group CC.B by random chance.

bcc423 · Jan 1, 2010

Aim for 98% and you're only 1/50 wrong. If you accidentally toss away one good landing page out of 50, who cares.

I just plugged the data for K34.B,CC.A vs K34.B,CC.B into the"Split Test Accelerator" that was linked above.

I got 99.6%.

bcc423 · Jan 1, 2010

The point is we don't know shit.

Think about it, were you ever able to continuously increase the conversion rate of any sales page? Or did you get stuck at some point?

And when you did get stuck, what did it look like?
Were you constantly finding better versions and constantly updating your control, yet for some reason the overall (historic) conversion rate didn't continue to increase?

Sounds familiar?

It's as if you are constantly improving (according to each test you run), yet that overall improvement doesn't materialize on the greater scale. Why is that?

7figures · Jan 1, 2010

bcc423 said:
Well, I posted raw numbers, not just percentages. So have a go at it with any formulas you can imagine.

As for "statistical significance" vs. "real-world significance" -- that's something inapplicable in our case.

Since any difference in effectiveness of the sales page has "real-world" significance for us.

In the non-copywriting world, there are plenty of dependencies that are statistically significant but completely useless. And I think that's what you are referring to.

But sales page split-testing, there is no such thing. If something _really_ increases your conversion rate, it can't possibly be useless.

Thats true, but that doesn't mean there isn't some difference in populations that is giving you this error - like you said, the more responsive people were randomly assigned into CC.B. Thats why effect size is used, which I would wager for your test is very small (can't compute it without SD). Any test with a large enough n will give you a significant result, although not necessarily a meaningful one.

But you're right in that we don't know shit, and these kinds of tests will never tell us with absolute certainty version X beat control. But a 5% error rate, in the grand scheme of things, isn't a huge deal

That time is better spent getting new revenue sources coming in.

bcc423 · Jan 2, 2010

Are you talking about signal vs noise for confidence=(signal/noise)*sqrt(size)?

I believe that's the formula. Or something close to that.

Or is it something else?

bcc423 · Jan 2, 2010

Just for the record, I agree that time is better spent increasing the volume. But still, this makes me wonder.

Any practical tips on statistical significance?

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

New member

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

New member

New member

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

New member

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

New member

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

PHP |&#13019;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

New member

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

&#1074;&#1089;&#1077; &#1084;&#1091;&#1076;&#1072;

все муда

все муда

все муда

все муда

все муда

все муда

все муда

все муда

PHP |㋛

все муда

все муда

все муда

все муда

все муда