Dangers of using memcache counters for a/b tests

We ran into a hairy problem with our App Engine A/B testing framework a few months ago. It’s a problem unique to those relying on memcache to increment counters with statistical meaning in a shared resource environment like App Engine.

Since there are more live-action teenage mutant ninja turtle films than people who fit that bill (sorry, one person on the internet, I know you matter), this is more a story about an interesting problem than a service post.

The symptoms

Some of our A/B tests started looking a little…surprising. “500+ users mastered Dividing Decimals in 20 problems or less, and 200+ users mastered it in 30 problems or less,” our dashboard claimed. Hmmm. Impossible, inconsistent results like that started popping up all over. I had that awful feeling you get in your gut when you square yourself to the possibility that the metrics you’ve been using to make decisions are straight-up broken. Trust in our A/B experiment results plummeted.

Not ideal.

The problem

Our A/B framework uses memcache as the fastest way to increment its internal counters every time somebody new participates in an experiment or triggers a conversion. For example, when a new user first comes to our exercise that helps build intuition for one step equations, they may be handed one of two alternatives for each problem they encounter. A counter (alternative_A_participants) is incremented accordingly. Another counter keeping track of conversions (alternative_A_conversions) is incremented similarly when we think they’ve mastered the skill. This is how we keep track of our A/B alternatives and their relative performances.

Now, there are two properties of these counters that aren’t negotiable. They must be fast, because A/B testing shouldn’t degrade performance for users. And they must be atomic, because otherwise we’d lose tons of data when we get hundreds or thousands of new participants per second, say during traffic events.

In App Engine land, memcache is the clear winner when in need of a fast, atomic counter. You could build a sharded datastore counter for atomicity, but you’ll sacrifice speed because the datastore can’t keep up with memcache. You could try something like kicking off a task queue for every increment, but you may quickly breed a single user’s HTTP request into 5 separate requests for your instances to swallow. Other tools exist, of course, but they aren’t available in App Engine.

Turns out memcache.incr is hard to beat.

The astute among you are already concerned. “Memcache shouldn’t be used to store data you can’t afford to lose,” you chide, “you never know when memcache will decide to evict your data”. True. In a situation like this, you have to be constantly running a process that persists data from memcache into something more permanent, like the App Engine datastore. We do. But you still don’t have any guarantee that a malevolent memcache won’t evict unsaved data before it gets persisted. So you also have to be willing to occasionally lose some data to bad luck. We are.

We knew that we’d occasionally lose count of an experiment’s participant or two or three or twenty. This is a relatively rare problem we’re willing to swallow. Assuming there is no statistical bias in memcache’s tendency to evict the counter for alternative_A_conversions vs. the counter for alternative_B_conversions, we felt comfortable comparing an experiment’s alternatives and making meaningful decisions.

That assumption was way off. Think about the problem that a PaaS like App Engine has to solve when offering memcache to all its applications. They can’t just run one big, honkin’ instance of memcache on one bigger, honkin’er machine somewhere. They have to distribute their load across multiple completely independent instances of memcache. Which means when you call memcache.incr(key), a value could be incremented on one of any number of different App Engine memcache instances…and the instance chosen depends on a hash of the key being incremented.

In simpler words, memcache.incr("alternative_A_conversions") stores its counter on a different machine than memcache.incr("alternative_B_conversions"). Since memcache is a service App Engine shares among all its applications, it’s highly likely for one memcache machine to be under considerably different memory pressure than another.

Streak and Pulse are two heavy App Engine users with whom we’ve probably shared memcache space.

Long story short? Some memcache keys (alternative_A_conversions) can survive in memcache for 10 minutes before being evicted, giving our background persistence task plenty of time to run. Others (alternative_B_conversions) may only last 10 seconds under high pressure. This behavior is consistent for each key…creating a statistical bias that favors one A/B alternative over another by repeatedly evicting only one of the key’s counters before we have a chance to persist its data.

The solution

What to do? We could switch away from memcache for our counters, but as mentioned above the alternatives aren’t great. We could give up on our realtime dashboard for A/B results, stop worrying about atomic counters entirely, log events for every participant and conversion, and periodically analyze our A/B alternatives offline. This may be the future for us, but we’ve been pressed for time (pretty sure we’re unique in that respect) and hoped for a quicker fix.

We had one hack available to us thanks to App Engine’s treatment of the namespace parameter available in all memcache calls. If you provide a namespace in your call to memcache.incr, App Engine would use that namespace to determine which memcache instance to send the value to, *not* the key. This meant we could keep all keys for an experiment in a single memcache instance, which would remove the statistically biased eviction patterns. While I’m normally a lover of such simple solutions, we heard this would soon be changing, and even identically-namespace’d keys would be spread among multiple instances in coming weeks. No go.

Luckily, Google’s Fred Sauer is brilliant and had a better idea. He suggested using bit offsets to store multiple counters’ values within a single memcache key. Since memcache.incr’s ints have 64 bits, we could split them up into 4 16-bit counters, each one counting up to a maximum of 2^16-1=65,535 before needing to be persisted over to the datastore.

This ended up working beautifully. Each time we want to increment alternative_A_conversions, we grab the memcache value stored by key all_conversions and increment the counter stored in bit positions 0-15. When we want to increment alternative_B_conversions, we grab the exact same memcache key and increment the counter in bit positions 16-31.

Even the most trivial bit twiddling makes web devs like me feel all hardcore. It’s a little sad.

Using a single memcache key for all of an experiment’s counters means we never have to worry about statistical bias in eviction policies affecting our ability to compare alternatives A and B. If one of an experiment’s counters is evicted at an unfortunate time, they all are.

Our A/B experiment data has looked consistent since this change, and trust in the framework is largely restored. Hats off once again to Fred and the App Engine team.

11/25/12 — 7:31pm Permalink
  1. bjk5 posted this