tl;dr — If you’re using task queues on App Engine and your task execution speeds vary greatly, you can get yourself into serious performance trouble. We addressed this by explicitly separating fast and slow tasks, and we released a little utility to help you do the same.

Only one user has ever earned this badge
Quick story: even though we put in a lot of work making sure Khan Academy users are awarded badges in real-time, we also run background badging processes to make sure we didn’t miss any. This process uses GAE’s mapreduce framework to map over all users and make sure their badges are up-to-date, and it looks something like:
# Called once for every user
def badge_background_check(user):
if user.has_recent_activity():
user.update_badges()
has_recent_activity is really fast, and update_badges can be really slow. This doesn’t play nice with App Engine’s request scheduler, and here’s why.
When you start firing off a bunch of tasks to a specific task queue, App Engine keeps a running average of how long tasks in that particular queue are taking. If they’re really fast (< 1000ms is what we’ve heard…), you can think of the task queue being painted with a shiny yellow smiley face. If they’re slow, the queue gets a sad red frowny face stamped on its forehead.

Don’t judge just because the sad one sounds like Eeyore. Neither of these situations cause problems.
If you’ve got a shiny yellow smiley face, App Engine’s scheduler will try to get your tasks done quickly by queuing up your fast tasks in the same lines that your users’ requests wait in. As long as these tasks stay fast, your app will scale when more instances are needed, your users should never notice that they’re standing in the same line as the fast tasks, and all is good.
Thing is, it’s even ok to have a frowny face. App Engine will use a different, slower scheduler to hand out your work without interrupting user facing requests. You’ll never wait in the same line as a user request, and you can take all the time in the world (as long as the world only lasts for the next 10 minutes) without worrying about someone emailing and complaining that your site’s gotten really slow lately.
The misery starts when you have a queue that’s mostly fast but hits sporadically slow tasks. Now your queue is all shiny and yellow and smiley and sharing the same checkout lines as user facing requests, but every once in a while a grumpy red frowny face stands in line, takes five full minutes at the cash register asking about the store’s raincheck coupon policy, and ruins everybody’s day. Your queue’s sporadic behavior is now directly hurting your users’ perceived performance.
This has caused us some very serious perf hiccups. 90% of our users don’t need to run update_badges on any given day, so the average task speed is very fast. When we hit a slow task, instance request queues can grind to a halt. At our worst, we’ve seen user facing requests sit waiting in the request queue for 9000ms (that’s approximately one full eternity) before our code even gets a chance to run.*

I was going to put an animated gif of a yellow smiley face that suddenly flashes a red frowny face here, but then I remembered that I’d have to kill myself for having an animated gif.
The gae_fast_slow_queue utility helps work around this as quickly as possible by making it easy to run a bunch of fast tasks while identifying slow tasks and splitting them off into a separate queue:
@fast_slow_queue.handler(lambda user: user.has_recent_activity())
def badge_background_check(user):
user.update_badges()
Now, when badge_background_check gets queued up, it’ll always only run the fast lambda function to keep the main queue yellow and smiley. If real work needs to be done, it fires off a different task in a different queue, and this one is guaranteed by our utility to take at least 1000ms so it stays red and frowny forever. GAE’s scheduler will steer these clear of user facing requests, and you don’t have to worry about your task impacting users’ perceived perf.
* We noticed this after deploying gae_mini_profiler and browsing around the site — every once in a while a developer got frustrated waiting forever for a page to load, then the profiler would come back and add insult to injury by screaming, “Woohoo this page only took 150ms!” You can confirm issues like these by looking for high pending_ms values in your logs.