A/B testing still works. [Sarcastic *PHEW*].

After releasing GAE/Bingo, we received a number of worried correspondences from various very worried correspondents. It seems that GAE/Bingo, along with practically every other A/B testing framework out there, violates some purist principles of how to do significance testing.

The crux of the argument, reworded so simply that I’m pretty sure all statisticians (I admittedly know nothing about stats) would string me up:

If you repeatedly check the results of an experiment, sometimes you’ll see statistically significant results that aren’t actually significant.

So if you’re constantly checking your A/B dashboard and making decisions based on what it tells you, you’re often screwing up.

It’s a mathematically sound argument, as explained to me by my much smarter teammates. And it must be absolutely devastating for all the programmers who went out and bought the Razer Mamba Elite Wireless Gaming Mouse just to increase their click speed so they could mash the refresh button on their A/B dashboards as fast as possible.

Here’s the thing. I know of absolutely nobody who runs an A/B test like a crazed puppy who keeps sprinting loops around your legs hoping that…..ohboyohboyohboyohboy…..after the next 360° you’ll have lowered the puppy treat to the floor.

That doesn’t mean the argument isn’t valid. If you do check your dashboard every 5 seconds like a crazed puppy and immediately end experiments at the first sign of stat sig, then you probably should read the article and…..ummmmm…..find better uses for your time.

Luckily for us, one of my much smarter teammates with much more experience analyzing numbers’n’stuff landed an early modification to GAE/Bingo that should pacify all worried correspondents:


A historical graph of one of our A/B tests and each alternative’s performance. On our dashboard it’s interactive, weeeeeeeee.

By showing this graph everywhere our dashboard shows A/B results and waiting for the results to stabilize, we can be confident that we’re not making a snap judgment in the zone of idiotic decisions.


Danga zone

Ok, good. We’re safe. But what about everybody else? Did Fog Creek and 37signals and everybody including Google immediately start hemorrhaging money due to their reliance on faulty A/B tests which this truth came to light???

My guess is no, because A) they aren’t making snap judgments at the first sign of stat sig and B) with significant traffic, many of our A/B test experiments don’t even have a zone of idiotic decisions. Lots of ‘em look something like this:

 

…and it’s pretty clear in which cases a difference has been made.

Comments 11/14/11 — 11:51pm Permalink