App Engine Performance Hacks You’ll Probably Never Need, Part I

Building GAE/Bingo required reaching into the bag of performance tricks a couple times. We needed long-term persistence of the data behind many A/B experiments, with stats accumulating at 500 reqs/sec, without slowing down pageloads.

Some of the wabbits we pulled out of the hat are pretty cool. Some are probably really stupid but get the job done. I’ll throw a couple your way and let you choose which is which.

Access a select few from a list of entities on every single request — and you don’t know which until the request is done

GAE/Bingo makes it easy to run A/B experiments in one line of code: ab_test("monkeys"). This means during any given request, an individual user might interact with a number of different experiments depending on various code paths. Without knowing ahead of time which ab_test("monkeys") or ab_test("gorillas") or ab_test("chimpanzees")'s are going to run, we need to minimize the number of roundtrips spent talking to memcache or the datastore.

At first it makes sense to put a collection of all Experiment models in a single memcache slot. This is kinda helpful because at the beginning of the request you can:

experiments = memcache.get("experiments")

…and then each time you need to work with an experiment you can:

experiment = experiments[experiment_name]

You’ve only got one memcache call regardless of how many experiments you’ll be interacting with on each request.

There’s a big problem. Getting a bunch of objects out of memcache also involves deserializing all of those objects, not just transferring them over the wire. And at least until the 2.7 release ships, deserializing objects innnn ppuuurreeee Pyyyyttthhhooonnnnn iiiiiiiiiisssssssssssssss rrrreaaaaaaaaaaaaalllllllllllllyyyyyyyyyyyyy ssssssssssssssssssslllllllllllllllllllllloooooooooooooooooooowwwwwwwwwwwwwwwwwwwwww. If you’re running, like, 50 live A/B experiments and your user needs to interact with, like, three of them when requesting /profile, you’ve got, like, 47 experiment-deserializations worth of performance waste.

Here’s where we pulled out the same trick we first used when building a fast real-time badging system for Khan Academy. Since you’ll almost never be interacting with all of the experiments in an individual request, you really really Really don’t want to spend time deserializing them. So what happens if you serialize the objects once (in this case using protocol buffers) before putting them in the memcache collection?

Things are a little bit slower when creating the collection:

        "monkey": db.model_to_protobuf(experiment_monkey).Encode(),
        "gorilla": db.model_to_protobuf(experiment_gorilla).Encode(),

Getting the collection from memcache is MUCH faster:

experiments = memcache.get("experiments")

…because instead of deserializing 50 experiment models, you’re unpacking 50 pre-serialized protocol buffers. Fast. Many much fast.

Now, when you need an actual experiment model, you just:

experiment = db.model_from_protobuf(entity_pb.EntityProto(experiments[experiment_name]))

…and only pay the deserialization penalty when you actually need to use the object.

This screenshot from my recent Khan Academy Friday tech talk should make everything totally clear. Especially without any context, I’m pretty sure seeing a picture of a gnome riding a platypus while carrying a monkey explains everything.

Bottom line: we make one single memcache call coupled with the minimum amount of experiment deserialization necessary, regardless of how many A/B experiments are running or which experiments are used by each request. There are lots of arguments over the fastest way to (de)serialize objects in App Engine — the fastest is to avoid the issue as much as possible.

Where we use this:

Split testing experiments and our real-time badge framework.

put_async without waiting for the put to finish — someone’s gonna yell at me

I’ll get in trouble for this or maybe even ruin the fun for everyone. This involves starting an asynchronous datastore put via db.put_async and then walking away with your hands in your pockets acting like nothing happened.

There are very clear instructions by the talented App Engine team that you should find some time to wait for the get_result() of any call to db.put_async(monkey). It’s also very clear that if you don’t wait for put_async to finish, App Engine is going to wait for you. In other words, you can’t magically send off a bunch of put_async’s and then send your response to the user without waiting for the put to complete.

return render_template("mwuuahaha_im_not_doing_anything_else.html")

You can, however, send off a put_async and then do everything else your request could possibly think of doing (including rendering templates and such) without waiting for the response. App Engine will make sure the response finishes, but if you just kick off the put_async and then walk away and handle the rest of your request, you can maximize concurrency of your request’s work w/ the asynchronous put.

There are *lots* of other ways to get a very similar effect. All of them are probably more kosher. I won’t list them here. This just happens to be a neat little trick that you can trigger with one line of code without worrying about any other boilerplate.

Where we use this:

Find the spots yourself and make fun of me. I have a strong feeling this’ll be replaced in the future.

Long-term persistence without waiting on the datastore

This one doesn’t belong in a “you’ll probably never need” post. It’s extremely common and handy: throw data in memcache (fast) and then run a background task or cron job that persists the data from memcache to the datastore (slow).

We actually get a little trickier because we need to persist lots of data that’s coming in quite often: each and every user’s participation and conversions in each and every A/B test. These events could be triggered multiple times per request for each user. It’s not quite clear how we’d put this data in memcache and what scheme would be running in the background to send it all to the datastore.

We opted for a bucketing system. Every time a user participates or converts in an A/B test, we randomly choose one of 50 memcache buckets and throw their user id in the bucket. When any of those buckets begins to overflow, we fire off a deferred task queue task to poke through the overflowing bucket, pull each user’s data out of memcache, and whisk it into the datastore.

bucket = random.randint(0, 50)
key = "_gae_bingo_identity_bucket:%s" % bucket

list_identities = memcache.get(key) or []

if len(list_identities) > 50:
    deferred.defer(persist_gae_bingo_identity_records, list_identities)

Where we use this:

GAE/Bingo and, in simpler fashions, pretty much all over the place.

More wabbits where those came from

If any of these hacks help someone else out, please let me know ASAP so I can win the bet against myself.

Regardless, expect more of these posts in the future. Tricks like these have been critical to keeping Khan Academy fast while adding new tools.

10/22/11 — 1:08am Permalink