


After releasing GAE/Bingo, we received a number of worried correspondences from various very worried correspondents. It seems that GAE/Bingo, along with practically every other A/B testing framework out there, violates some purist principles of how to do significance testing.
The crux of the argument, reworded so simply that I’m pretty sure all statisticians (I admittedly know nothing about stats) would string me up:
If you repeatedly check the results of an experiment, sometimes you’ll see statistically significant results that aren’t actually significant.
So if you’re constantly checking your A/B dashboard and making decisions based on what it tells you, you’re often screwing up.
It’s a mathematically sound argument, as explained to me by my much smarter teammates. And it must be absolutely devastating for all the programmers who went out and bought the Razer Mamba Elite Wireless Gaming Mouse just to increase their click speed so they could mash the refresh button on their A/B dashboards as fast as possible.
Here’s the thing. I know of absolutely nobody who runs an A/B test like a crazed puppy who keeps sprinting loops around your legs hoping that…..ohboyohboyohboyohboy…..after the next 360° you’ll have lowered the puppy treat to the floor.
That doesn’t mean the argument isn’t valid. If you do check your dashboard every 5 seconds like a crazed puppy and immediately end experiments at the first sign of stat sig, then you probably should read the article and…..ummmmm…..find better uses for your time.
Luckily for us, one of my much smarter teammates with much more experience analyzing numbers’n’stuff landed an early modification to GAE/Bingo that should pacify all worried correspondents:

A historical graph of one of our A/B tests and each alternative’s performance. On our dashboard it’s interactive, weeeeeeeee.
By showing this graph everywhere our dashboard shows A/B results and waiting for the results to stabilize, we can be confident that we’re not making a snap judgment in the zone of idiotic decisions.
Ok, good. We’re safe. But what about everybody else? Did Fog Creek and 37signals and everybody including Google immediately start hemorrhaging money due to their reliance on faulty A/B tests which this truth came to light???
My guess is no, because A) they aren’t making snap judgments at the first sign of stat sig and B) with significant traffic, many of our A/B test experiments don’t even have a zone of idiotic decisions. Lots of ‘em look something like this:


…and it’s pretty clear in which cases a difference has been made.
Building GAE/Bingo required reaching into the bag of performance tricks a couple times. We needed long-term persistence of the data behind many A/B experiments, with stats accumulating at 500 reqs/sec, without slowing down pageloads.
Some of the wabbits we pulled out of the hat are pretty cool. Some are probably really stupid but get the job done. I’ll throw a couple your way and let you choose which is which.
GAE/Bingo makes it easy to run A/B experiments in one line of code: ab_test("monkeys"). This means during any given request, an individual user might interact with a number of different experiments depending on various code paths. Without knowing ahead of time which ab_test("monkeys") or ab_test("gorillas") or ab_test("chimpanzees")’s are going to run, we need to minimize the number of roundtrips spent talking to memcache or the datastore.
At first it makes sense to put a collection of all Experiment models in a single memcache slot. This is kinda helpful because at the beginning of the request you can:
experiments = memcache.get("experiments")
…and then each time you need to work with an experiment you can:
experiment = experiments[experiment_name]
You’ve only got one memcache call regardless of how many experiments you’ll be interacting with on each request.
There’s a big problem. Getting a bunch of objects out of memcache also involves deserializing all of those objects, not just transferring them over the wire. And at least until the 2.7 release ships, deserializing objects innnn ppuuurreeee Pyyyyttthhhooonnnnn iiiiiiiiiisssssssssssssss rrrreaaaaaaaaaaaaalllllllllllllyyyyyyyyyyyyy ssssssssssssssssssslllllllllllllllllllllloooooooooooooooooooowwwwwwwwwwwwwwwwwwwwww. If you’re running, like, 50 live A/B experiments and your user needs to interact with, like, three of them when requesting /profile, you’ve got, like, 47 experiment-deserializations worth of performance waste.
Here’s where we pulled out the same trick we first used when building a fast real-time badging system for Khan Academy. Since you’ll almost never be interacting with all of the experiments in an individual request, you really really Really don’t want to spend time deserializing them. So what happens if you serialize the objects once (in this case using protocol buffers) before putting them in the memcache collection?
Things are a little bit slower when creating the collection:
memcache.set(
"experiments",
{
"monkey": db.model_to_protobuf(experiment_monkey).Encode(),
"gorilla": db.model_to_protobuf(experiment_gorilla).Encode(),
...
}
)
Getting the collection from memcache is MUCH faster:
experiments = memcache.get("experiments")
…because instead of deserializing 50 experiment models, you’re unpacking 50 pre-serialized protocol buffers. Fast. Many much fast.
Now, when you need an actual experiment model, you just:
experiment = db.model_from_protobuf(entity_pb.EntityProto(experiments[experiment_name]))
…and only pay the deserialization penalty when you actually need to use the object.

This screenshot from my recent Khan Academy Friday tech talk should make everything totally clear. Especially without any context, I’m pretty sure seeing a picture of a gnome riding a platypus while carrying a monkey explains everything.
Bottom line: we make one single memcache call coupled with the minimum amount of experiment deserialization necessary, regardless of how many A/B experiments are running or which experiments are used by each request. There are lots of arguments over the fastest way to (de)serialize objects in App Engine — the fastest is to avoid the issue as much as possible.
Where we use this:
Split testing experiments and our real-time badge framework.
I’ll get in trouble for this or maybe even ruin the fun for everyone. This involves starting an asynchronous datastore put via db.put_async and then walking away with your hands in your pockets acting like nothing happened.
There are very clear instructions by the talented App Engine team that you should find some time to wait for the get_result() of any call to db.put_async(monkey). It’s also very clear that if you don’t wait for put_async to finish, App Engine is going to wait for you. In other words, you can’t magically send off a bunch of put_async’s and then send your response to the user without waiting for the put to complete.
db.put_async(experiment_monkey)
return render_template("mwuuahaha_im_not_doing_anything_else.html")
You can, however, send off a put_async and then do everything else your request could possibly think of doing (including rendering templates and such) without waiting for the response. App Engine will make sure the response finishes, but if you just kick off the put_async and then walk away and handle the rest of your request, you can maximize concurrency of your request’s work w/ the asynchronous put.
There are *lots* of other ways to get a very similar effect. All of them are probably more kosher. I won’t list them here. This just happens to be a neat little trick that you can trigger with one line of code without worrying about any other boilerplate.
Where we use this:
Find the spots yourself and make fun of me. I have a strong feeling this’ll be replaced in the future.
This one doesn’t belong in a “you’ll probably never need” post. It’s extremely common and handy: throw data in memcache (fast) and then run a background task or cron job that persists the data from memcache to the datastore (slow).
We actually get a little trickier because we need to persist lots of data that’s coming in quite often: each and every user’s participation and conversions in each and every A/B test. These events could be triggered multiple times per request for each user. It’s not quite clear how we’d put this data in memcache and what scheme would be running in the background to send it all to the datastore.
We opted for a bucketing system. Every time a user participates or converts in an A/B test, we randomly choose one of 50 memcache buckets and throw their user id in the bucket. When any of those buckets begins to overflow, we fire off a deferred task queue task to poke through the overflowing bucket, pull each user’s data out of memcache, and whisk it into the datastore.
bucket = random.randint(0, 50)
key = "_gae_bingo_identity_bucket:%s" % bucket
list_identities = memcache.get(key) or []
list_identities.append(ident)
if len(list_identities) > 50:
deferred.defer(persist_gae_bingo_identity_records, list_identities)
Where we use this:
GAE/Bingo and, in simpler fashions, pretty much all over the place.
If any of these hacks help someone else out, please let me know ASAP so I can win the bet against myself.
Regardless, expect more of these posts in the future. Tricks like these have been critical to keeping Khan Academy fast while adding new tools.
Continuing my trend of straight-up copying the work of the smartest people I know, I recently decided to tackle Khan Academy’s A/B testing problem (we didn’t have any A/B testing) by bringing Patrick McKenzie’s A/Bingo into App Engine land.
So here you go: GAE/Bingo is released and should get anyone on App Engine up and A/B testing in minutes. It’s currently in production on Khan Academy and performing well with hundreds of requests per second.
A/Bingo is an A/B testing framework for Ruby on Rails. It’s specifically designed to make the creation of split test experiments as quick and painless as possible.
GAE/Bingo is a re-imagining of this framework’s core design principles inside of Google App Engine. GAE/Bingo was specifically built for use at Khan Academy, which means:
We’ve got a million and one ideas to try out at Khan Academy. What tweak to our game mechanics will best motivate students to challenge themselves? What message makes it most likely for a student to sit back and watch a video when they really need to take time, slow down, and re-learn a core concept?
An A/B testing framework gives us the tools necessary to start answering these questions with experiments and hard(er) data. With ~1.5MM practice exercises answered per school day by Khan Academy students, we have a treasure trove of student activity from which to learn.
We also wanted to spread the love. Patrick helped out the Rails community by open sourcing A/Bingo, and we wanted to do the same for App Engine. I also couldn’t find any Python split testing framework that satisfied our needs and stayed true to the design principles of A/Bingo.
Plus, why not take advantage of the fact that App Engine’s vertical stack empowers framework creators to go pretty far when it comes to creating a drop-in, It Just Works experience? We hope GAE/Bingo accomplishes this and helps out some others in the community.
Start an A/B test in one line:
# Returns "chimpanzee" to half your users and "zorilla" to the other half
animal = ab_test("cute_logo_animal", ["chimpanzee", "zorilla"])…and when something good happens, score a conversion in one line:
bingo("cute_logo_animal")These two lines will automatically take care of experiment creation, user tracking, consistent A/B results for each individual user, and statistical analysis. You can do a lot more, of course, when it comes to specifying alternatives and tracking conversions — check out the docs. There are some pretty simple (optional) hooks that make it very easy to get consistent A/B results even when your users transition from anonymous to logged-in.

Trivial example: an A/B test proving that messaging a student with “You’re ready to move on!” is statistically more likely to encourage a student to move into to new content than “Nice work!”
Once at least one user causes the above lines of code to execute, you’ll get statistical analysis and be able to control your experiment from the dashboard.
Please do. Let me know how it goes via Twitter or email. Patch up all the inevitable bugs and fill in all the major holes left by our desire to ship v1. We’ll be continuing to improve the framework, and all help is welcome.
Enormous thanks goes out to Patrick McKenzie for his framework’s inspiration and the encouragement to follow his lead. I’ll be blogging more in the future about how we’re using GAE/Bingo and how we keep track of hundreds of requests per second w/ persistent storage and minimal impact on page load times.
Want to be handed a major portion of Khan Academy ownership, ridiculously high expectations, and a bunch of mentorship from our full-time devs? Sign up now. We believe anybody can help the world get a great education, and we accept interns year-round.
I can’t remember a time in my history of small company software development that hasn’t felt like sitting in a rickety donkey kong cart with jet afterburners attached and blazing, and everybody inside is just trying to keep the staples (why did they use staples?) to hold long enough for us to make it around the next bend in the tracks.
That being said, at least in my limited donkey kong cart experience, summer internship seasons always stand out in my mind as new high watermarks for shipping speed and dev intensity.

Just like Sal’s videos, our practice exercises are at the very heart of Khan Academy.
This summer’s class of Khan Academy interns has been no different. Our interns come in with promises of being handed ownership and control over major, user-facing features, and in return we demand excellence; it’s pretty similar, actually, to Khan Academy’s educational belief of encouraging experimentation but expecting mastery. They were shipping features on day one.
Any dev team out there not acknowledging the fact that high school and college students are capable of showing up on your doorstep and almost immediately redefining major portions of your product for the better is either failing to recruit well or is plain old missing out. Big time.
If that’s you, I hope to change your mind with this post.
Each of the following improvements to Khan Academy was contributed either largely or entirely by our interns this summer. Four of ‘em are in college, one just graduated high school, and one hasn’t even started applying to colleges yet.
This is a major body of work.
We learned about all types of weaknesses in our old exercises after last year’s pilots, and we’ve tackled them head on by improving our hints, removing multiple choice answers, focusing on the user’s exercise experience, and building new ways of asking old questions.

Almost all multiple choice questions are gone.
We’ve focused on helping our developers and our community create new exercises quickly. We’ve written better documentation, shipped simpler dev tools, and built solid bug reporting workflows to maintain a healthy stream of new, quality exercises. Our interns are responsible for not only porting all existing exercises to our new tools but also developing brand new frameworks to help exercises draw graphs, randomize questions, generate procedural hints, and more.

The Summer 11 interns’ recently launched new exercises have already served up over 5,000,000 math problems to Khan Academy students.
We know that at the end of the day, the only thing that matters is whether or not Khan Academy students are really learning, and a large quantity of quality interactive exercise content is core to that mission. KA wouldn’t be what it is today without the large quantity of quality videos Sal has created. As a development team and community, we should consider ourselves challenged to match his videos with quality exercises. The team’s efforts this summer have given us the tools necessary to take a crack at this considerable challenge.
In fact, exercises are so important to us that we’re now hiring full-time exercise developers to come push these tools to their limits and redefine what it means to learn online. If you want to join some very passionate devs, either apply now or join our open source exercise community.
Followers of this blog (also: leprechauns, unicorns) may remember a post in which I struggled to decide if Khan Academy should follow Stack Overflow’s registration model by allowing non-logged-in users to participate in all of our content and automatically transferring their work to a permanent account after they log in.
The more we thought about it and read posts about unregistered users like Fred Wilson’s, the more we realized that making access to our educational content as easy as possible was the right thing to do.

We track users’ progress and encourage them to log in, but we never get in anyone’s way when they’re trying to learn.
Two interns completely owned the design and implementation of this feature. If you go to Khan Academy now, you can start earning points and badges for watching videos and working on exercises without ever logging in. We encourage users to login at various milestones in their progress, but we never stop them from continuing to use the site or force them to close a popup.
If you log in, you keep all your progress. This has significantly reduced our bounce rate by getting rid of painful login walls, and we’re continuing to watch other statistics to see the effects of this change.
I can’t count of the number of times I’ve heard users ask us to display individual video progress next to each one of the 2,500+ video links on our homepage. I’m not gonna get into the various technical challenges here (maybe a different blog post), but this is nontrivial for a brain like mine.

Luckily, we hire interns that are way smarter than me and are able to solve such problems. These days, whenever you watch a piece of a Khan Academy video, skip around it, pause it, play it, or whatever, we keep track of extremely precise video progress statistics and display useful progress indicators next to every video link.
One of the earliest features launched during Summer ‘11 was the ability to share videos, exercises, and badges on Facebook and Twitter.

Facebook sharing has slowly gone up (even during the natural academic lull of summer), and we hope this trend will continue.
…and those are just the bigger changes. Our interns continue to claw for inch after inch after inch of improvements like nicer internal statistics, faster deploy scripts, performance tweaks, and better user account management.
Summer’s not over yet. Omar Rizwan, Jeff Ruberg, Joel Burget, Igor Terzic, Parker Kuivila, and Ben Alpert continue to set the current bar for Khan Academy internship classes. As an organization, we aim to beat this mark in the future. But those six are pushing hard to make it a tough task, and khanacademy.org is improving for the better, quickly.
Don’t doubt the inexperienced. Get your team’s recruiting, mentorship, and code reviews right (easier said than done), and a summer internship can be one of the best things that’s ever happened to your product. I already can’t wait to drop some major challenges in the laps of our two incoming Fall interns to see what they can build.
Want to be handed a major portion of Khan Academy ownership, ridiculously high expectations, and a bunch of mentorship from our full-time devs? Sign up now. We believe anybody can help the world get a great education, and we accept interns year-round.

Big shoes to fill.
tl;dr — If you’re using task queues on App Engine and your task execution speeds vary greatly, you can get yourself into serious performance trouble. We addressed this by explicitly separating fast and slow tasks, and we released a little utility to help you do the same.

Only one user has ever earned this badge
Quick story: even though we put in a lot of work making sure Khan Academy users are awarded badges in real-time, we also run background badging processes to make sure we didn’t miss any. This process uses GAE’s mapreduce framework to map over all users and make sure their badges are up-to-date, and it looks something like:
# Called once for every user
def badge_background_check(user):
if user.has_recent_activity():
user.update_badges()
has_recent_activity is really fast, and update_badges can be really slow. This doesn’t play nice with App Engine’s request scheduler, and here’s why.
When you start firing off a bunch of tasks to a specific task queue, App Engine keeps a running average of how long tasks in that particular queue are taking. If they’re really fast (< 1000ms is what we’ve heard…), you can think of the task queue being painted with a shiny yellow smiley face. If they’re slow, the queue gets a sad red frowny face stamped on its forehead.

Don’t judge just because the sad one sounds like Eeyore. Neither of these situations cause problems.
If you’ve got a shiny yellow smiley face, App Engine’s scheduler will try to get your tasks done quickly by queuing up your fast tasks in the same lines that your users’ requests wait in. As long as these tasks stay fast, your app will scale when more instances are needed, your users should never notice that they’re standing in the same line as the fast tasks, and all is good.
Thing is, it’s even ok to have a frowny face. App Engine will use a different, slower scheduler to hand out your work without interrupting user facing requests. You’ll never wait in the same line as a user request, and you can take all the time in the world (as long as the world only lasts for the next 10 minutes) without worrying about someone emailing and complaining that your site’s gotten really slow lately.
The misery starts when you have a queue that’s mostly fast but hits sporadically slow tasks. Now your queue is all shiny and yellow and smiley and sharing the same checkout lines as user facing requests, but every once in a while a grumpy red frowny face stands in line, takes five full minutes at the cash register asking about the store’s raincheck coupon policy, and ruins everybody’s day. Your queue’s sporadic behavior is now directly hurting your users’ perceived performance.
This has caused us some very serious perf hiccups. 90% of our users don’t need to run update_badges on any given day, so the average task speed is very fast. When we hit a slow task, instance request queues can grind to a halt. At our worst, we’ve seen user facing requests sit waiting in the request queue for 9000ms (that’s approximately one full eternity) before our code even gets a chance to run.*

I was going to put an animated gif of a yellow smiley face that suddenly flashes a red frowny face here, but then I remembered that I’d have to kill myself for having an animated gif.
The gae_fast_slow_queue utility helps work around this as quickly as possible by making it easy to run a bunch of fast tasks while identifying slow tasks and splitting them off into a separate queue:
@fast_slow_queue.handler(lambda user: user.has_recent_activity())
def badge_background_check(user):
user.update_badges()
Now, when badge_background_check gets queued up, it’ll always only run the fast lambda function to keep the main queue yellow and smiley. If real work needs to be done, it fires off a different task in a different queue, and this one is guaranteed by our utility to take at least 1000ms so it stays red and frowny forever. GAE’s scheduler will steer these clear of user facing requests, and you don’t have to worry about your task impacting users’ perceived perf.
* We noticed this after deploying gae_mini_profiler and browsing around the site — every once in a while a developer got frustrated waiting forever for a page to load, then the profiler would come back and add insult to injury by screaming, “Woohoo this page only took 150ms!” You can confirm issues like these by looking for high pending_ms values in your logs.
We just made a small change to Google App Engine Mini Profiler to deal with the fact that spitting out profiling stats on every rendered page still misses all the pages that aren’t rendered: the POST side of temporary redirects.
It’s super common to submit a POST and redirect to a GET. When using a profiler that spits out performance data about the currently rendered page, you lose all of that profiling data about the POST request, even though the user still had to wait for the request to finish.

Not no more
Sticking with the belief that making the path from developers to profiling data as short as possible will produce a faster site, we now automatically expose these redirects and all of their performance stats. App Engine Mini Profiler handles this with ajax requests as well (you’ll see two profiler entries in the corner when sending an ajax POST and getting redirected to a GET full of JSON data), and in the unfortunate event that you’re bouncing the user around a long chain of redirects (more than one and you should feel guilty), it’ll keep track of all ‘em.

Exposing the redirect’s data
Patrick Bateman: (Looking at the business card) “Look at that subtle off-white coloring. The tasteful thickness of it. Oh my god…”
I finally understood the full extent of Bateman’s jealousy recently when reading about the MVC Mini Profiler created by the Stack Exchange team. My eyes scanned all the features, and I started to lose my mind:
…my god, it even had a watermark.

As Jeff said, simply showing a render time for all pages can be critical
It really hit home for me when reading Sam Saffron’s post: “No web frameworks seem to provide a comprehensive approach to page profiling out of the box.” It’s true. This level of ubiquitous profiling baked right into all major web frameworks would really change the game. I decided I couldn’t live without something like mvc-mini-profiler any longer, so we just released gae_mini_profiler.
gae_mini_profiler is a drop-in, ubiquitous, production profiling tool for Google App Engine heavily inspired by MVC Mini Profiler. Since App Engine and MVC have so many fundamental differences, the tools aren’t identical, and it’s not really a port as much as a ridiculously kindred spirit. If you want, you can play around with a gae_mini_profiler-enabled demo of GAE’s example chess app — in this demo case, the profiler is enabled for all users.

Taking a glance at one of our slower pages in production
I’m confident that always having all of this data right in front of our developers as we browse around the live Khan Academy site will improve performance. I’ve already noticed a few problems I didn’t previously know about in just the first few hours after deploying.

AJAX calls stack up as they come in, ready for examination

You can dive deep on the details of each request
All props go to Jarrod, Sam, and probably the rest of the Stack Exchange team for blazing the path with mvc-mini-profiler and inspiring this tool. I borrowed ideas ranging from basic UI to dupe query detection to jQuery.tmpl usage to ajax request stacking to lots of other stuff I can’t remember. The sizable differences in the tools really spawn from the type of data I get from Appstats and cProfile.
The project is obviously young, but any App Engine app should be able to drop it in quite quickly with minimal configuration. Instructions are at the repo, and I hope others find it useful.
“…relief washes over me in an awesome wave.”
We’re thinking about how to connect our community of students, and there’s a lot we don’t know. Here’s the little we do:
We want to build these amplifiers directly into Khan Academy. No matter which direction we brainstorm, the possibilities for connecting our users don’t seem to stop. A small subset of what we’ve considered:
While we don’t claim to have the answers, you’ll see more and more experiments coming from our core belief that teaching others is a great way to learn. We’ve already dabbled (can’t emphasize that enough…dabbled) with this goal in the question and answer section beneath each of our videos:

Students teaching students
It seems likely that Sal’s videos each have a few common misunderstandings or logical jumps that trick the majority of viewers, and this feature is designed to let the community fill in the most important gaps.

Even little questions like these are important
Simple voting mechanics (with one or two doodads from the best voting implementation out there) hopefully surface the most common questions
This small experiment has already taught us some important lessons. I see no reason to stop the lazy list writing style I started above:
We’ll be experimenting more, and we’re always listening. The direction of educational communities online feels wide open, and every day we have to pick our path. Share your opinions, they matter a lot.
As soon as I joined KA, I began receiving emails from people asking to integrate with Sal’s content in various ways: in mobile apps, in search engine results, in other educational sites. So we threw together the quickest, ugliest API we could get away with at the time in order to give other developers access to our playlist and video library.
It’s now a few months later, and that hack of an API no longer cuts it. It lacks authentication, user-specific data, API versioning, and other basics. We need these basics for the official mobile apps we’re developing, and the development community needs them to make apps that do more than link to our videos. Which is why we just released what we’re calling API version 1.0.
The new API gives developers access to not just playlists and videos, but also exercises, badges, users, and logs of historical data about each and every problem or video students have watched. I’m sure we missed a lot of stuff, so developers: just let us know when you need access to something else.

Using the API to check out my performance in our Writing Expressions 1 exercise.
We’re slotted to be one of the featured APIs in next weekend’s hack4knowledge — I’m crossing my fingers we see some cool educational hacks using our data.