Three minute quiz: App Engine datastore performance

As somebody now spending all his time in NoSQL land, my brain perked up when working through The Three Minute SQL Performance Quiz. It was a blast from the past for me, a chance to remember all the little ins’n’outs of SQL performance from a quaint old time when I used to be able to write JOIN statements.

So I thought it’d be fun to develop a similar performance challenge for this new NoSQL life of mine. Since we use App Engine at Khan Academy, we’ll focus on the App Engine datastore. Proceed.

Question 1


QueryModel
Monkey.all().filter(
    "genus IN",
    ["Ateles", "Cebus", "Aotus"]
  ).fetch(10)
class Monkey(db.Model):
  genus = db.StringProperty(
    indexed=True)

Hold on there — a major improvement is possible or All looks good to me — don’t go changin’ a thang

You’re right! Queries that use the IN operator may look like a single query, but they actually run multiple queries behind the scenes, one for each item in the list. That means if you ran the above query, opened Appstats, and looked at this request’s profile, you’d see this:

…and that’s not ideal if you really care about this request’s performance. The requests are running asynchronously and overlap as much as possible — which is great — but you’re still running three requests and increasing the likelihood that one of ‘em will slow you down.

If you want this to be blazing fast, you almost certainly want to denormalize this set membership into a property that can be queried without an IN operator. Perhaps you’d add something like is_in_favorite_genus = db.BooleanProperty(indexed=True) to Monkey, set that property to True if the genus is one you’re interested in, and then change your query to Monkey.all().filter("is_in_favorite_genus =", True). That’d be a significant improvement — especially if your IN list contained many items. That’s the App Engine way.

Question 2


QueryModel
Monkey.all().filter(
    "unique_name =",
    "bob_the_monkey"
  ).get()
class Monkey(db.Model):
  unique_name = db.StringProperty(
    indexed=True)

Hold on there — a major improvement is possible or All looks good to me — don’t go changin’ a thang

You’re right! If you’re loading a single entity using a unique identifier, queries aren’t as fast as loading by key or key name. There are multiple ways to load an entity by key that we won’t get into here — just know that if you have unique identifiers for your models, you should strongly consider constructing your entities using the unique identifier as part of your model’s key name — Monkey(key_name="bob_the_monkey", **kwds) — so you can quickly retrieve it later: db.get_by_key_name("bob_the_monkey")

If you’re curious why this is faster than a query, it makes sense if you’re willing to swallow the gross oversimplification that a query is just quickly looking up the key in an index and then fetching the entity using the key. If you’ve already got the key, why query?

Question 3


One of the coolest dogs ever or Meh, she’s so-so

You’re right! Shouldn’t need explainin’.

Question 4


Putting an entityModel
m = Monkey(
    name="Bob the Monkey",
    favorite_color="blue",
    favorite_food="pizza",
    worst_enemy="honey_badger"
  )

m.put()
class Monkey(db.Model):
  name = db.StringProperty()
  favorite_color = db.StringProperty()
  favorite_food = db.StringProperty()
  worst_enemy = db.StringProperty()

Hold on there — a major improvement is possible or All looks good to me — don’t go changin’ a thang

You’re right! Kinda. It depends how you’re querying this entity. All of the datastore properties on Monkey are indexed by default, and that means every time you put() a Monkey, you’re spending time writing new values to all of those indexes.

Now, if you need to be able to run queries to find Monkeys based on their favorite_color, favorite_food, worst_enemy, and so on — then you’re doing the right thing. You need the indexes to be able to query them. If your code base is anything like Khan Academy’s used to be, however, you may have tons of properties w/ automatic indexes that never, ever need to be queried. We fixed this by using db.StringProperty(indexed=False) in these cases (and writing a linter that requires us to explicitly specify indexed=True|False for every datastore property).

Write speed may not matter for your app, but if it does you can speed up writes by not wasting time writing to indexes you don’t need.

Question 5


QueryModel
query = Monkey.all()
query.filter("name =", "bob")
query.filter("zoo =", "Manhattan")
query.filter("hats >", 5)
query.get()
class Monkey(db.Model):
  name = db.StringProperty(indexed=True)
  zoo = db.StringProperty(indexed=True)
  hats = db.IntegerProperty(indexed=True)

Hold on there — a major improvement is possible or All looks good to me — don’t go changin’ a thang

You’re right! This query will work. And when you first start out and only have a couple hundred Monkeys, it’ll be blazing fast. But as your family of Monkeys grows, you may find yourself with a Very Serious™ performance issue. Why? You have all of the properties indexed, you’re only asking for a single entity…what’s the problem?

You don’t have a perfect index defined. Without an index that perfectly covers all of the properties involved in the query’s filters, here’s what App Engine has to do when you ask it to fetch a single entity: start querying the datastore for, say, Monkeys with name == “Bob”, retrieving them from the datastore, and filtering through them to find any with zoo == “Manhattan” and at least 5 hats. That’s not what you want, and it looks like this in Appstats:

That’s a disaster. Depending on the shape of your data, you could spend hundreds of RPCs just returning a single entity from the datastore. You want App Engine to query the datastore for “Monkeys named Bob in the Manhattan zoo who own at least 5 hats” in one swift, single blow — it should only need one RPC. And it’ll do exactly this if you have a perfect index. You’ll need an entry in index.yaml, something like:

- kind: Monkey
  properties:
  - name: name
  - name: zoo
  - name: hats

If you aren’t careful and don’t keep an eye to make sure that your important queries have indices that perfectly cover ‘em, you could have a piece of code that runs snappy one day and, months later when your datastore fills up with Monkeys, it’ll suddenly take 5+ seconds to return a single entity. We’ve experienced this at Khan Academy more than once. It’s public enemy number one due to how easy it is to not notice the problem at first and then encounter brutal performance problems later.

Question 6


QueryModel
query = Monkey.all()
query.filter("name =", "bob")
query.filter("zoo =", "Manhattan")
query.filter("hats >", 5)
query.get()
class Animal(db.PolyModel):
  name = db.StringProperty(indexed=True)
  zoo = db.StringProperty(indexed=True)

class Monkey(Animal):
  hats = db.IntegerProperty(indexed=True)
 index.yaml
 
- kind: Animal
  properties:
  - name: name
  - name: zoo
  - name: hats

Hold on there — a major improvement is possible or All looks good to me — don’t go changin’ a thang

You’re right! What now?! You’ve got your perfect index on all the queried properties! Well, not really.

This’ll suffer from the exact same Very Serious™ performance issue from Question 5. By using PolyModel and querying specifically for Monkeys — not just any old Animal — you’re implicitly adding another filter to your query. This filter will use PolyModel’s built-in special property, class, to only return Monkey results. If you don’t have an index that covers all filtered properties including class, you ain’t go no perfect index.

- kind: Animal
  properties:
  - name: class
  - name: name
  - name: zoo
  - name: hats



That’s it, you’re done! I could keep going, but by now you’ve spotted a trend in the answers and I’m pretty sure we’re past the 3-minute mark.

If you want more or have other interesting quiz questions I’d love to know. And if you geek out on perf work like me, you know what to do.

3/5/14 — 12:47am Permalink