Almost a year ago we released a version of Patrick McKenzie’s A/Bingo, built for Khan Academy and others in need of an App Engine-optimized split testing tool. It’s been in constant production use at Khan since then. We’ve made mistakes, learned lessons, and spent plenty of time scratching our heads pondering results.
Thanks to those lessons, the current version of GAE/Bingo is a lot more powerful than it used to be. And I’m in a sharing mood.
We use GAE/Bingo to keep an eye on tons of different metrics whenever we’re testing a change. For KA that means tracking things like exercise problems completed, proficiencies earned in math topics, videos watched, return visits, registrations, and anything else that Jace and his team decide belong in our collection of core metrics. So it’s not rare to spot us yelling “Bingo!” (A/Bingo’s cute lingo for an A/B conversion) at all sorts of interesting points throughout our code:
def attempt_problem(): # Yelling "bingo!" is A/Bingo's cute lingo for a conversion gae_bingo.bingo("problem attempt") ... if earned_proficiency: gae_bingo.bingo("new proficiency")
bingo()s stick around; we rarely have to change them. That means whenever we run a new experiment, all interesting metrics are tracked independently and automatically with one line of code:
if gae_bingo.ab_test("new homepage layout"): # ...try new homepage layout
…so we can ship that line of code, put on our lab coats, open up GAE/Bingo’s dashboard, and start poking around the incoming data. We don’t have to think about how to track all these conversion metrics every single time we add a new test.
Look’a all those metrics just begging to be clicked on.
If you can’t tell, this screenshot is our new homepage layout experiment (which happens to toy w/ an emphasis on search).
Of course, we can also choose to specifically track, say, “proficiencies earned in the Solid Geometry exercise" when tweaking hints for solid geometry — we just add a new
bingo() at the right spot.
I’ve covered this before: there are lots of dangers in trusting a single signal of statistic significance sent off from any A/B test, but that doesn’t mean they don’t work. We’ve learned a buttload (shut up, it’s a real measurement) more from our data ever since we started snapshotting A/B test results hourly and graphing them next to our framework’s automatic stat sig analysis. And I’m sure we’ve made fewer mistaken snap judgments.
A single “99% significant” number doesn’t really capture the full story of an A/B test.
This particular test is keeping an eye on the effects of asking users an interactive question
in the middle of watching a video.
We’re currently running 9 different A/B tests, each tracking somewhere between 10 and 35 different metrics. That’s been pretty common for the past few months. A large body of knowledge about our past experiments is accumulating. As our team has grown, it’s become more and more important to figure out how to easily save and share past test results.
The latest version of GAE/Bingo lets you archive your experiment with notes and, because we like a little spice in our software, a choice of emotional reactions to the outcome. Take a look:
We like keeping stories of old experiments around, even if the data was confusing.
This particular test was our attempt at encouraging students to use interactive hints when trying to solve difficult problems.
Our list of archived experiments tells a story about what we’ve been trying to improve.
This post about puzzling A/B test results at Microsoft came at just the right time for us. Anybody running A/B tests should sit down and read it immediately (anybody not running A/B tests has bigger fish to fry).
Even with all our data, a healthy number of users, and some fancy tools, it can be really hard to understand what is happening when looking at the results of an A/B test. The linked paper puts it best:
…the devil is in the details and the difference between theory and practice is greater in practice than in theory…It’s easy to generate p-values and beautiful 3D graphs of trends over time…real challenge is in understanding when the results are invalid, not at the sixth decimal place, but before the decimal point…Generating numbers is easy; generating numbers you should trust is hard!
I’m ok raising my hand and admitting that we’ve seen some really confusing numbers from our experiments. We’ve seen an experiment involving interactive physics videos increase the number of math skill proficiencies earned…on a completely separate part of our site. We’ve had a homepage experiment drastically impact user registration numbers…but the registration effect started 4 days after the experiment began. We tested a “Send Sal thanks!” button designed to distract users from adding “thank you!!!!” comments under our videos…and users started leaving more comments. I could go on and on.
I’m even ok raising my hand and admitting that these discrepancies caused our team to run one of those (brilliantly named) A/A tests. It has been very humbling and healthy to compare two identical versions of Khan Academy, watch the results come in, and wonder what stories you’d be explaining to yourself if these numbers were tied to a real experiment. If you’ve ever spent time looking at A/B test results, I cannot recommend running an A/A test highly enough. It is a good gut check reminding you what your mind does when interpreting results.
Sure, some of these discrepancies are almost certainly caused by bugs in our code. Some are caused by the fact that GAE/Bingo doesn’t do a good job handling outliers yet, so a single participant can act all crazypants and make our graphs go wonk. But sometimes it just takes us a long time to figure out what the heck’s going on. I’m glad Microsoft published their paper admitting the same.
Please let me know if you’ve learned any lessons using GAE/Bingo elsewhere. Or if you have tips for improving more. I’d love to share.