Friday, August 29, 2008

MySQL Query Analyzer: A Peek Under the Hood

I had a hard time coming up with a title for this blog post. I really wanted to capture its essence. Several discarded titles include, but are not limited to:
  • How It Works
  • The Gory Details
  • Everything You Never Wanted to Know
  • The Insider's Perspective
I picked the lamest one, hoping that it is catchy enough to ensnare the unfortunate soul that comes across my blog. Just in case you don't know by now, I work for MySQL (ok, Sun) on the Enterprise Tools team. We're approaching (some sort of) release of MySQL Enterprise Monitor 2.0, with the headline feature Query Analysis.

What I want to achieve with this article is to give definitions, data, flow, etc., that describe how we look at things internally in the architecture of the application. This is intended to give the hapless reader an idea of what Query Analysis can do, and consequently what it can not.

What is a query?


I actually call them statements inside the application. To me, query seemed to imply asking about something. And we track more than that. Is an INSERT or UPDATE a query? Is a CREATE TABLE a query? Not in my feeble mind. So, I chose statement because it is generic enough to encompass them all, yet seems still to have ties to database vernacular. Too bad "statement analysis" doesn't quite have that sexy marketing appeal.

A unique statement/query is an application's query that is normalized and executes against a given database. By normalized, I mean all literals are replaced with '?', (some) comments are stripped, the case of SQL itself is made capitalized, and extraneous whitespace is collapsed.

There is a subtle phrase in there that is very important -- "a unique statement ... executes against a given database." Basically, we identify a statement key as {database, normalized text}. That means if you have statement SELECT 1 and execute it in the context of database 'test', switch databases to webapp_foo, and execute it again, in the mind of MySQL Query Analyzer it is a different statement.

This works well for things like replication scale out, where a slave farm will have bunches of queries executing against the same database(s). It breaks down if, somehow, your application uses multiple databases, but yet you really want the queries to be treated the same for aggregation. We haven't found this as the typical use case (yet), but if we get feedback we can revisit this in future releases.

For a statement that executes against a MySQL server instance, I call that a 'statement summary.' Because it is at this level that we actually track and summarize execution statistics. More on this later...

Where does the data come from?

Enter the MySQL Proxy, with some enterprise-only plugins and lua scripts. The proxy sits between your application and your MySQL database server. All queries passing through it are captured, normalized, and stored until they are reported back to the Monitor server. The infamous agent is now a plugin that shares a process inside the chassis, along side the proxy. Proxy and agent communicate, in this case via lua, in order to push query data to the agent plugin so the agent can send it up to the monitor in the same fashion as any other data it collects.

There are potentially other sources of data... but that's a story for a different day.

What data is tracked?


For a given statement, we track:
  • number of executions
  • execution time
  • min/max execution time(s)
  • rows returned/affected
  • min/max row(s)
  • result set size
  • min/max result set size
The normalized statement is tracked by the proxy. Each statement that fits in that normalized bucket is added to the ongoing statistics. The Monitor backend periodically asks for a snapshot (currently every minute), at which point the proxy resets his statistical aggregation (think, "read-clear" semantics).

Also, some extra metadata is kept with the query, for housekeeping reasons:
  • database (which can actually be null - not every query executes in a database context)
  • normalized query text
  • a hash of the text
  • query type (select, insert, update, etc.)
What does the agent/proxy actually send to the Monitor?

Actually? You don't want to know. When you get your hands on it, you can turn up the agent log level to debug and see for yourself. It's not like it's a secret. It's just not worth mentioning here.

Conceptually? Now there's a better topic... Although we don't take full advantage of it in all agent communications, we've moved to a more RESTful approach to agent interaction. Internally, we take the legacy xml agent protocol and turn that into sexy 'representations' that we put to internal REST APIs. An example JSON representation for the query data looks something like this:
PUT /instance/mysql/statementsummary/f8ccc1d2-f63b-4c30-9d42-1d383e626510.em2.02a6ed17a7627d1777f33a6062826951
{
"name": "f8ccc1d2-f63b-4c30-9d42-1d383e626510.em2.02a6ed17a7627d1777f33a6062826951",
"values": {
"bytes": 325,
"count": 8,
"database": "em2",
"exec_time": 3174,
"max_bytes": 45,
"max_exec_time": 613,
"max_rows": 1,
"min_bytes": 38,
"min_exec_time": 394,
"min_rows": 1,
"query_type": "SELECT",
"rows": 8,
"text": "SELECT hibtag0_ . tag_id AS tag1_18_ , hibtag0_ . tag AS tag18_ FROM tags hibtag0_ WHERE hibtag0_ . tag = ?",
"text_hash": "02a6ed17a7627d1777f33a6062826951"
}
}
This is our conceptual representation of a statement summary (remember, a statement summary is a summarized statement execution on a single mysql instance). Statements themselves actually do not exist on the proxy/agent side or in the communications, since they are always tracking a mysql instance. The statement itself is purely a Monitor-server side concept that we track for rollup purposes. The UUID you see in the RESTy URI is the mysql.inventory unique identifier, something we have to introduce for 2.0 to identify, without any uncertainty, a specific mysql instance we are monitoring. Then, it has the database name and the text hash. The uuid.db.text_hash defines the key for this statement summary.

What does the Monitor do with the statement summaries?


From the statement summary, we pull out the pieces which define a statement, which happen to be immutable. They are the database, query text, text hash, and query type. After storing that, we create a statement summary, which basically just ties the statement to a server instance. We also store the creation time of this object/row, so we can track the first time this query was ever seen on a server (or, ever seen period when you take the min and group by the statement). Finally, we take the ongoing snapshot data and store that.

Tell me how the aggregation works.

Aggregation works on two levels. Because we split out immutable statement data, and statement summary (per server) data, we can then aggregate on each of them.

Lets revisit what these snapshots mean. Each statement summary data instance is a snapshot of the summarized statistics for the normalized query over the time period since the last time it was reported. This gives us the powerful ability to then aggregate these snapshots over an arbitrary time range chosen by the end user.

Consider the single server aggregation case. If count is the number of executions in a given minute, and we store this count snapshot over many minutes, then the total count for an arbitrary time range is merely the sum of the counts (to the minimum granularity of our collection frequency, which is currently one minute). Rows and result set size aggregation work exactly the same way.

The mins and maxes aren't magical either. The mins/maxes per snapshot are for that one minute range. If you want the mins/maxes over the larger requested range, just take the min of mins and maxes of the maxes.

There is one "trick" on the monitor-side to show averages over the arbitrary range. We show average execution time, average rows returned/affected, and average result set size. What does an 'average' mean in this case? We want an average per query execution (not per unit time). Average execution time would be the total execution time divided by the total executions. Well, we aggregate execution time, and we aggregate count... so we just take those columns in the query we build, and write an expression column of time/count. There's your average. Same for rows and bytes -- just take the aggregated totals and divide by the count.

Now, what about summaries across multiple servers? Since we have the immutable statement data broken out, its just another aggregation on top of the server one, grouped by the statement identity. And the same sum, min, max, and average tricks still apply.

Can I see statistics per user?


Not at this time. The user stats might be pretty easy to tack on in the next release. It would just become another element to the {database, query} key. Then we could still selectively aggregate at the user level, the server level, or the statement level, or any combination.

Sounds neat! When can I see it?

Currently, query analysis with Monitor 2.0 is in limited beta to select enterprise customers. Email beta-enterprise@mysql.com for more information. There is also a special MySQL Query Analysis forum. It has some good links, and as people get their hands on it, I bet it becomes the central place to go for all things Query Analysis.

What does Query Analysis in MySQL Enterprise Monitor 2.0 look like?

Check out our product manager's blog. He's better at doing touchy feely marketing flap like screen shots. ;)

Conclusion


Here, I covered all the gory details from what is the data we collect, where we collect it from, and how we conceptually view and aggregate it on our backend. I'd like to touch on how the UI interacts with the data -- specifically group views, server views, sorting, searching, and aliasing, but I think this article has hit capacity. Perhaps a blog for another day...