Analyzing Billion of Postgres Query Plans

Making Froxt fast is a unique and particularly difficult adventure in performance engineering. Our customers run billions of queries per day and each one is unique. What’s more, our product is designed for rich, ad hoc analyses, so the resulting SQL is unboundedly complex.

A core concept in our product is the idea of an “event definition”, which is a mapping between a human-level event like “Signup” or “Checkout” and what that event means in terms of raw clicks. This layer of indirection between the concept being analyzed and the underlying data makes the product very powerful and easy to use: you can add a new event definition and run analysis over all of your historical data as if you had been tracking “Signups” since you first installed the product. But it comes at a cost: everything we do becomes more complicated because the analysis is in terms of a dynamic schema that isn’t fully known until reading time. Handling this indirection at all is its own fascinating topic.

To execute a query, we take the query expressed in the UI, compile it to SQL, inline the event definitions, and then run the resulting SQL query across a cluster of PostgreSQL instances. This process can easily result in incredibly hairy SQL queries.

The queries only become more complicated as they make use of the more powerful features we support. A query can include filters on the number of times a user has done a given event in an arbitrary time range as well as break down the results by multiple different fields simultaneously.

After realizing how ineffective this “whack-a-mole” performance engineering approach was, we tried to design a better way. What we really wanted was some way we could collect enough information for every query as we initially ran it, such that we can always determine why a query was slow. As it turns out, we did manage to find a way to do that with Postgres.

Postgres provides an extension called auto_explain. When enabled, auto_explain automatically logs the query plan and execution information for every query as it’s run. It’s exactly like EXPLAIN ANALYZE but automatically records the information for every query as it happens. Depending on how auto_explain is configured, you can obtain some or all of the following information:

  • How the query was executed.
  • How long it took to execute each part of the query.
  • How many rows were returned by each part of the query?
  • How many pages were read from disk or from cache from each part of the query?
  • How much time was spent performing I/O in each part of the query?

This is an absolute gold mine of information for analyzing why queries are slow. Importantly, all of this information is gathered as the query is run, so we do not run into issues where a slow query is not reproducible, since we no longer need to reproduce the slowness in order to determine why the query was slow. Even better, auto_explain has the option to obtain the output in JSON. This makes it easy to programmatically analyze the query plans.

To gather all of this information, we wrote a short job that tails our Postgres logs and watches for any Postgres plans. When the job sees a query plan, it sends the plan information to a separate Postgres instance. With all of this information gathered in one place, we can easily ask all sorts of questions. Since we now have the query plan for every single query that was ever executed in Froxt.

What’s more, we can track how these rates change over time, and prioritize work accordingly. A lot of these causes will require multi-month engineering efforts to resolve, so it is critical that we know upfront how much the overall impact will be on our query performance.