Finding and killing long running InnoDB transactions with Events

I’ve seen a number of solutions for finding long running transactions or sessions within InnoDB / MySQL now. Every single one of them has (in the past by necessity) been implemented as a script (other than one, more on that one later) that is either invoked manually, or via some cron job, that then connects and tries to find, and possibly diagnose or kill, transactions that break some “long running” threshold. We also have a rule in MySQL Enterprise Monitor that flags this for further inspection.

Solutions like this have one major downfall however – they often require a new connection to the database every time that they want to perform the check (though MEM holds a connection open for monitoring).

Often what you see with many high throughput instances is that once you get a long running transaction that is holding some locks, many sessions soon start to stack up behind this, and in many cases fill up the instance to max_connections. You then play roulette on whether you can get a new connection to start finding and killing the long running transaction(s) or not.

In the past this was a necessity because it was impossible to parse SHOW ENGINE INNODB STATUS from within the database (such as within a procedure), to then build a set of KILL statements programatically (because you couldn’t consume the SHOW output in a meaningful way).

However, with 5.1 and the InnoDB plugin (and hence, 5.5+ by default too), there is the new INFORMATION_SCHEMA.INNODB_TRX table, which lists all currently active transactions within InnoDB, including the same output as SHOW ENGINE INNODB STATUS per transaction, plus a few extras:

mysql> select * from INFORMATION_SCHEMA.INNODB_TRX\G
*************************** 1. row ***************************
                    trx_id: 7C6
                 trx_state: RUNNING
               trx_started: 2011-05-31 11:17:10
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 2
       trx_mysql_thread_id: 11706
                 trx_query: insert into t1 values (sleep(40))
       trx_operation_state: NULL
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 1
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 1
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000

Because we have this nice shiny new table, personally I think it’s time to start throwing away most of those scripts, and using another of the new features in 5.1 – events. The major advantage to this is that it is always running within the instance, it can’t be blocked by any max connection limit, and it doesn’t require any DBA / admin intervention, it’s the perfect BOFH tool. Developers Beware.

So here’s the basis for how I would do it:

CREATE DATABASE IF NOT EXISTS myadmin;
USE myadmin;

DROP TABLE IF EXISTS kill_long_transactions_log;

CREATE TABLE kill_long_transactions_log (
  id INT AUTO_INCREMENT PRIMARY KEY,
  ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  thd BIGINT,
  user VARCHAR(16),
  host VARCHAR(64),
  statement TEXT,
  KEY (ts, user),
  KEY (user),
  KEY (host)
) ENGINE = InnoDB;

DELIMITER $$

DROP EVENT IF EXISTS kill_long_transactions$$

CREATE EVENT kill_long_transactions
ON SCHEDULE EVERY 10 SECOND
DO
  BEGIN
	  DECLARE max_transaction_time INT DEFAULT 10; 
	  DECLARE done INT DEFAULT 0;
	  DECLARE killed_id BIGINT;
	  DECLARE killed_user VARCHAR(16);
	  DECLARE killed_host VARCHAR(64);
	  DECLARE kill_stmt VARCHAR(20);
	  DECLARE running_stmt TEXT;

	  DECLARE long_transactions CURSOR FOR
	   SELECT CONCAT('KILL ', trx.trx_mysql_thread_id) kill_statement,
	          trx.trx_mysql_thread_id thd_id,
	          ps.user,
	          ps.host,
	          trx.trx_query
	     FROM INFORMATION_SCHEMA.INNODB_TRX trx
	     JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id
	    WHERE (UNIX_TIMESTAMP() - UNIX_TIMESTAMP(trx.trx_started)) > max_transaction_time
	      AND user != 'system_user';
	  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
	
	  OPEN long_transactions;
	  
	  kill_loop: LOOP
	    FETCH long_transactions INTO 
	      kill_stmt, killed_id, killed_user, killed_host, running_stmt;
	
	    IF done THEN
	      LEAVE kill_loop;
	    END IF;
	    
	    SET @kill := kill_stmt;
	    
	    PREPARE stmt FROM @kill;
	    EXECUTE stmt;
	    DEALLOCATE PREPARE stmt;
	    
	    INSERT INTO kill_long_transactions_log (thd, user, host, statement)
	    VALUES (killed_id, killed_user, killed_host, running_stmt);
	    
	  END LOOP;
	  CLOSE long_transactions;
	  
	  DELETE FROM kill_long_transactions_log
	   WHERE ts < NOW() - INTERVAL 7 DAY;
  END$$

DELIMITER ;

You can tweak the following from the above to fit your needs:

How often to check:

ON SCHEDULE EVERY 10 SECOND

How long should a transaction be running to get killed:

DECLARE max_transaction_time INT DEFAULT 10;

How long to keep data for before purging:

WHERE ts < NOW() - INTERVAL 7 DAY;

Once you have that created, make sure that the event scheduler is enabled:

mysql> SHOW GLOBAL VARIABLES LIKE 'event_scheduler';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+

And sit back and wait for the magic to happen.

Now if you want to poll for long running transactions, you can check the kill_long_transactions_log table after the fact for anything that has been killed (not should be killed):

mysql> select * from kill_long_transactions_log;
+----+---------------------+-------+----------+-----------+-----------------------------------+
| id | ts                  | thd   | user     | host      | statement                         |
+----+---------------------+-------+----------+-----------+-----------------------------------+
|  1 | 2011-05-31 11:13:15 | 11672 | msandbox | localhost | NULL                              |
|  2 | 2011-05-31 11:14:25 | 11681 | msandbox | localhost | insert into t1 values (sleep(40)) |
|  3 | 2011-05-31 11:16:15 | 11695 | msandbox | localhost | NULL                              |
|  4 | 2011-05-31 11:17:25 | 11706 | msandbox | localhost | insert into t1 values (sleep(40)) |
+----+---------------------+-------+----------+-----------+-----------------------------------+

Sometimes there may be a statement running when the session is killed, other times it could just be idle (and hence NULL), but still have a transaction open.

As a DBA, you might then want to try and find out what those sessions are doing, and in that case, you can disable/enable this event pretty easily at run time:

mysql> ALTER EVENT kill_long_transactions DISABLE;
Query OK, 0 rows affected (0.03 sec)

mysql> ALTER EVENT kill_long_transactions ENABLE;
Query OK, 0 rows affected (0.00 sec)

In < = 5.5, if you want to try and find out what these long running transactions are doing whilst this is disabled, I have a couple of recommendations:

Yoshinori’s solution is one of the only ones that doesn’t require a connection to the database, doing all of it’s sniffing external to the instance, somewhat like MEM’s Query Analyzer (though sniffing network traffic with libpcap instead), however unlike Query Analyzer, it tries to record the entire transaction verbatim, instead of aggregating likewise statements (which can happen with long running transactions that loop to insert data etc.).

I think it would be possible to do the same entirely from within performance_schema as of 5.6, using the events_statements_history table if enabled, from an event as well. This may serve both needs quite nicely, with no external solution required.

Maybe I’ll follow up on that one day.. :)

Monitoring MySQL IO Latency with performance_schema

Baron pointed to two really good articles recently by Brendan Gregg on Filesystem Latency (part1 / part2), that I too would recommend everybody read.

They rightly point out that trying to correlate IO statistics from tools such as iostat to a databases workload, especially when that database is on a shared resource, is fraught with complications – and further to that, not the full picture anyway, even on a dedicated system. Brendan points out that to get the full picture, you really need to be instrumenting your profiling at the read/write layer of the application – before getting to per filesystem / disk statistics.

Brendan has followed up on his posts with a third on how to track these filesystem calls from a running MySQL process with DTrace, which again I recommend. In this post Brendan says the following, linking to my blog on graphing statement execution in 5.6 (thanks Brendan):

If you are on MySQL 5.6 and later, it looks like you may also be able to 
get this information from the wait/io events from the new performance schema 
additions (I havenâ??t tried).

This is true.

However I’d like to show how filesystem latency can be monitored from the current MySQL 5.5 GA version, with performance_schema, on all platforms.

I’ve written about tracking IO with performance_schema in the past, and in retrospect, realize that I should have talked more about monitoring the latency aspects of IO as well, my bad. :) That post looked at the file IO summary tables.

Now let us take a look at what we can get from the wait event summary tables for IO.

When event timing is enabled within performance_schema, every single wait event is a measure of latency, whether that be to grab a synchronization point, or perform a file IO operation, or even later in 5.6 to take a table lock, perform table IO or network IO etc. – up to executing an entire statement or stage (including all locks, file IO, synch points etc. below them).

From the file IO perspective, all of these are instrumented around the layer that Brendan points to – interactions with the filesystem, not the disks, from within MySQL. You can see exactly which file IO events are instrumented within the performance_schema.setup_instruments table:

mysql> select * from setup_instruments where name like 'wait/io/file/%';
+--------------------------------------+---------+-------+
| NAME                                 | ENABLED | TIMED |
+--------------------------------------+---------+-------+
| wait/io/file/sql/map                 | YES     | YES   |
| wait/io/file/sql/binlog              | YES     | YES   |
| wait/io/file/sql/binlog_index        | YES     | YES   |
| wait/io/file/sql/relaylog            | YES     | YES   |
| wait/io/file/sql/relaylog_index      | YES     | YES   |
...
| wait/io/file/myisam/data_tmp         | YES     | YES   |
| wait/io/file/myisam/dfile            | YES     | YES   |
| wait/io/file/myisam/kfile            | YES     | YES   |
| wait/io/file/myisam/log              | YES     | YES   |
| wait/io/file/myisammrg/MRG           | YES     | YES   |
| wait/io/file/innodb/innodb_data_file | YES     | YES   |
| wait/io/file/innodb/innodb_log_file  | YES     | YES   |
| wait/io/file/innodb/innodb_temp_file | YES     | YES   |
+--------------------------------------+---------+-------+
42 rows in set (0.01 sec)

Each instrument of the wait/io/file type tracks a certain form of IO that is performed by MySQL, such as wait/io/file/innodb/innodb_data_file tracking datafile IO for InnoDB, and wait/io/file/sql/binlog tracking IO to binary logs at the SQL layer.

We do this in much the same way that Brendan does with DTrace, though instead we do this by wrapping the original functions, such as os_file_read(), os_file_write() etc., with functions that do the timing around them.

Each of these original functions are now macros within 5.5. They each call a performance_schema specific inline function that then records the start and end times of the calls, and collects other data specific to the type of wait event, such as which file we are operating on, the kind of file operation, the number of bytes for each, and where the operation was initiated from in the code.

These in turn then call the original, now renamed functions, such as os_file_read_func() and os_file_write_func() (meaning that if you want to use Brendan’s solution with 5.5+, you should probably switch it to use these functions directly). Here’s InnoDB’s pfs_os_file_read_func() as an inline example:

pfs_os_file_read_func(
/*==================*/
	os_file_t	file,	/*!< in: handle to a file */
	void*		buf,	/*!< in: buffer where to read */
	ulint		offset,	/*!< in: least significant 32 bits of file
				offset where to read */
	ulint		offset_high,/*!< in: most significant 32 bits of
				offset */
	ulint		n,	/*!< in: number of bytes to read */
	const char*	src_file,/*!< in: file name where func invoked */
	ulint		src_line)/*!< in: line where the func invoked */	
{	
	ibool	result;
	struct PSI_file_locker*	locker = NULL;
	PSI_file_locker_state	state;
	
	register_pfs_file_io_begin(&state, locker, file, n, PSI_FILE_READ,
				   src_file, src_line);
	
	result = os_file_read_func(file, buf, offset, offset_high, n);
	
	register_pfs_file_io_end(locker, n);

	return(result);
}

Now we have the background out of the way, we can take a look at how the data is exposed.

I’m going to start with the raw stream of data that is exposed within the wait event tables, in the events_waits_history_long table specifically, as this gives the 10,000 most recent events within performance_schema (by default), giving a pretty up to date snapshot of what is happening “now”.

As with some of my previous examples, I’m going to use a helper view to do this. The following gets the latest IO events, also showing the originating thread, operation type, size and latency of each request:

DROP VIEW IF EXISTS latest_file_io;

CREATE VIEW latest_file_io AS
SELECT IF(id IS NULL, 
             CONCAT(SUBSTRING_INDEX(name, '/', -1), ':', thread_id), 
             CONCAT(user, '@', host, ':', id)
          ) thread, 
       SUBSTRING_INDEX(object_name, '/', -1) file, 
       timer_wait/1000000 latency_usec, 
       operation, 
       number_of_bytes bytes
  FROM performance_schema.events_waits_history_long 
  JOIN performance_schema.threads USING (thread_id)
  LEFT JOIN information_schema.processlist ON processlist_id = id
 WHERE object_name IS NOT NULL
 ORDER BY timer_start;

mysql> select * from latest_file_io;
+------------------------+--------------------+--------------+-----------+---------+
| thread                 | file               | latency_usec | operation | bytes   |
+------------------------+--------------------+--------------+-----------+---------+
| msandbox@localhost:119 | ib_logfile1        |     834.7232 | write     | 2035712 |
| msandbox@localhost:119 | ib_logfile1        |     833.2492 | write     | 2035712 |
| msandbox@localhost:119 | ib_logfile0        |      11.2195 | write     |     512 |
| msandbox@localhost:119 | ib_logfile0        |      10.8199 | write     |     512 |
| msandbox@localhost:119 | ib_logfile0        |     716.1718 | write     | 1697792 |
| msandbox@localhost:119 | ib_logfile0        |     715.5535 | write     | 1697792 |
| msandbox@localhost:119 | ib_logfile0        |    8649.3640 | sync      |    NULL |
| msandbox@localhost:119 | ib_logfile1        |   35845.0394 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |     242.0453 | write     |  540672 |
| srv_master_thread:15   | ibdata1            |     241.2235 | write     |  540672 |
| srv_master_thread:15   | ibdata1            |    3337.6979 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |       6.3638 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4863 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5542 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4524 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6032 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6070 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6145 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5240 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5391 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5429 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6899 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4675 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5316 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5881 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6107 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5203 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5391 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6409 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5730 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5655 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6070 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7465 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6371 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7238 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7578 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7615 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7917 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6899 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7276 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.8219 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.8483 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7465 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7804 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      27.4456 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      20.4560 | write     |   16384 |
| io_handler_thread:7    | ibdata1            |      25.5342 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      17.5757 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      18.2317 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      19.8604 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      16.9989 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      15.9320 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      16.9801 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      14.6351 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      15.5362 | write     |   16384 |
| io_handler_thread:7    | ibdata1            |     250.6145 | write     |  360448 |
| srv_master_thread:15   | ibdata1            |    5418.0932 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |      45.0779 | write     |   49152 |
| srv_master_thread:15   | ibdata1            |      44.1052 | write     |   49152 |
| srv_master_thread:15   | ibdata1            |     468.7128 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |       3.9283 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6107 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5429 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      22.2656 | write     |   16384 |
| io_handler_thread:4    | ibdata1            |      33.1119 | write     |   32768 |
| srv_master_thread:15   | ibdata1            |     757.6984 | sync      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |       6.9217 | stat      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |      13.3496 | open      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |      18.1299 | read      |    3222 |
| msandbox@localhost:92  | latest_file_io.frm |      50.2993 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |     278.9498 | create    |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |     145.4617 | create    |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      48.4030 | write     |     176 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       7.2120 | write     |     100 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.6456 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.6833 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.5099 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4797 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4420 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       9.8058 | chsize    |    1024 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      15.6907 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      98.9889 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      13.2063 | open      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |     150.3212 | read      |      24 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       1.9265 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       6.6427 | read      |     339 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      13.3496 | open      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      69.0853 | write     |     696 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       9.8020 | write     |     728 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       6.2620 | write     |     696 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       4.2488 | tell      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       3.9736 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       1.2743 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      26.1676 | read      |    2120 |
+------------------------+--------------------+--------------+-----------+---------+
98 rows in set (0.16 sec)

Just for comparison, the first thread in the output was executing this:

mysql> insert into t1 (select * from t1);
Query OK, 65536 rows affected (0.78 sec)
Records: 65536  Duplicates: 0  Warnings: 0

mysql> select connection_id();
+-----------------+
| connection_id() |
+-----------------+
|             119 |
+-----------------+
1 row in set (0.00 sec)

In the version that I am using here (5.5.12) there is a bug in some cases of InnoDB IO (such as writing to the InnoDB log files) that double accounts for IO, so you have to do some manual work to exclude some rows (the second events of the same size but lower times) to find out just how much time was spent in this version. That annoys me. Luckily, we have since tracked down the problem there, and it’s been fixed and should be released in 5.5.13.

That results in 35845.0394 + 8649.3640 + 716.1718 + 11.2195 + 834.7232 = 46056.5179 microseconds, or roughly 0.046 seconds, out of the 0.78 seconds reported to the client. Most of the time at the filesystem layer was in doing an fsync() on the log files (which forced the data to hit the platters), rather than write().

The rest was spent doing “other things” in memory, which are then synced out-of-band by the background threads to the data files / filesystem outside of the query execution’s runtime. This is visible with srv_master_thread first doing a big write to the doublewrite buffer, and then merging those changes to the datafiles in 16K pages.

It’s also worth noting what is happening within the msandbox@localhost:92 thread at the end of the result. That is actually the file IO that was required to return the result of the query that I was executing. There will be a few events after that statement completed to clean them up etc.

The first 4 events against latest_file_io.frm are to get the structure of the view I was selecting against, and the rest are a result of the JOIN against INFORMATION_SCHEMA.PROCESSLIST, which in this case is forcing a temporary table to be created for the processlist result. In 5.6, there is no longer any need for this JOIN, as all the columns are available within the threads table, so the IO could be reduced to just the view definition then.

This section also helps to point out one of the deficiencies with using Brendan’s script (at least without extending it) to really see when filesystem interactions are the causes of overall latency – a large portion of the time spent there was in creating the temporary table files, not just writing to or reading from them. Likewise with the view and opening/closing.

This also helps to illustrate a point that Ronald Bradford has asked me in the past:

My work with the PERFORMANCE_SCHEMA has been limited to experimenting such as 
described by Tracking IO with performance_schema however a recent discussion 
with Mark Leith highlighted that with the performance_schema it will be possible 
to determine the exact size of disk based temporary tables created by the MySQL 
kernel for example.

Sorry for the delay Ronald, but I hope this shows a good example of exactly what kind of IO we track against temporary tables as well. If not, here’s a better one:

mysql> select file, round(sum(bytes)/1024/1204, 2) size_mb, round(sum(latency_usec), 2) latency_usec
    ->   from latest_file_io
    ->  where file like '#sql%' and operation = 'write'
    -> group by file, operation
    -> order by file;
+---------------------+---------+--------------+
| file                | size_mb | latency_usec |
+---------------------+---------+--------------+
| #sql3a48_575_2a.MYI |    0.00 |        10.11 |
| #sql3a48_575_2b.MYD |    2.11 |      7435.86 |
| #sql3a48_575_2b.MYI |    0.00 |        10.81 |
| #sql3a48_575_2c.MYD |    0.00 |        46.16 |
| #sql3a48_575_2c.MYI |    0.00 |        61.27 |
| #sql3a48_575_2d.MYD |    2.30 |      7638.26 |
| #sql3a48_575_2d.MYI |    0.00 |        63.24 |
+---------------------+---------+--------------+
7 rows in set (0.12 sec)

This may be a bit of a hit and miss approach, as this table only stores 10,000 events by default, and when you’re playing with temporary tables sometimes the number of file IO events could easily exceed this. Temporary tables are also very transient, MySQL doesn’t hold the file handlers open nor keep much metadata around once they are removed, so we don’t aggregate their data within the file summary tables at this time.

We do however summarize the events for regular table files, and the other files held open by the instance such as binary logs and log files etc. as shown above.

Each wait is summarized at 3 different levels within 5.5 – here are some examples, starting each with the raw output for the InnoDB file IO events:

Globally by event – events_waits_summary_global_by_event_name

mysql> select * from events_waits_summary_global_by_event_name
    ->  where event_name like 'wait/io/file/innodb/%';
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| EVENT_NAME                           | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| wait/io/file/innodb/innodb_data_file |        364 |   369575256440 |         395850 |     1015316638 |   241104845800 |
| wait/io/file/innodb/innodb_log_file  |         53 |   130454237420 |        2386410 |     2461400706 |    58808150830 |
| wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
3 rows in set (0.00 sec)

These allow you to find your top IO consumers by latency, and can easily be joined against the file_summary_by_event_name table, to also show the number of bytes read and written per type of IO:

DROP VIEW IF EXISTS top_global_io_consumers_by_latency;

CREATE VIEW top_global_io_consumers_by_latency AS
SELECT event_name,
       count_star,
       ROUND(sum_timer_wait/1000000000000, 2) total_sec,
       ROUND(min_timer_wait/1000000, 2) min_usec,
       ROUND(avg_timer_wait/1000000, 2) avg_usec,
       ROUND(max_timer_wait/1000000, 2) max_usec,
       count_read,
       ROUND(sum_number_of_bytes_read/1024/1024, 2) read_mb,
       count_write,
       ROUND(sum_number_of_bytes_write/1024/1024, 2) write_mb
  FROM performance_schema.events_waits_summary_global_by_event_name 
  JOIN performance_schema.file_summary_by_event_name USING (event_name) 
 WHERE event_name LIKE 'wait/io/file/%' 
 ORDER BY sum_timer_wait DESC LIMIT 5;

mysql> select * from ps_helper.top_global_io_consumers_by_latency;
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
| event_name                           | count_star | total_sec | min_usec | avg_usec | max_usec  | count_read | read_mb | count_write | write_mb |
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
| wait/io/file/myisam/dfile            |    1828445 |     10.82 |     0.74 |     5.92 | 361525.96 |    1250257 | 2094.98 |      577762 |  1019.34 |
| wait/io/file/myisam/kfile            |       1212 |      0.34 |     1.12 |   283.54 | 109499.59 |        132 |    0.03 |         696 |     0.02 |
| wait/io/file/innodb/innodb_data_file |        407 |      0.27 |     7.36 |   652.37 |  17684.12 |        397 |   10.14 |           4 |     0.06 |
| wait/io/file/sql/file_parser         |        108 |      0.15 |     1.25 |  1355.50 |  41132.89 |         20 |    0.03 |           5 |     0.02 |
| wait/io/file/sql/FRM                 |       1027 |      0.12 |     1.08 |   115.64 |  41406.17 |        431 |    0.14 |           0 |     0.00 |
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
5 rows in set (0.11 sec)

This is my test instance, so there is no great amount of IO going on. It shows in this example that MyISAM data file IO is the top consumer, followed by MyISAM index IO, InnoDB data file IO, view parsing, and frm file parsing respectively.

Per Thread by event – events_waits_summary_by_thread_by_event_name

mysql> select * from events_waits_summary_by_thread_by_event_name 
    ->  where event_name like 'wait/io/file/innodb/%';
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| THREAD_ID | EVENT_NAME                           | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
|         0 | wait/io/file/innodb/innodb_data_file |        297 |    17541591340 |        9183720 |       59062597 |     2244695700 |
|         0 | wait/io/file/innodb/innodb_log_file  |         25 |      594694880 |        2386410 |       23787795 |      291044000 |
|         0 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_data_file |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_log_file  |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
...
|         1 | wait/io/file/innodb/innodb_data_file |          0 |              0 |              0 |              0 |              0 |
|         1 | wait/io/file/innodb/innodb_log_file  |          0 |              0 |              0 |              0 |              0 |
|         1 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
51 rows in set (0.00 sec)

These are the events summarized by thread, so that you can trace which threads – either foreground or background – are likely to be consuming the most IO. This is just from a latency perspective however, and can not easily be correlated back to the file summary tables in a meaningful way. This means you can not easily correlate the amount of IO done with the time taken per thread (unless you are polling the current / history tables, if they are enabled).

However, it can provide insight in to who is suffering the most from filesystem latency:

DROP VIEW IF EXISTS top_thread_io_consumers_by_latency;

CREATE VIEW top_thread_io_consumers_by_latency AS
SELECT IF(id IS NULL, 
             CONCAT(SUBSTRING_INDEX(name, '/', -1), ':', thread_id), 
             CONCAT(user, '@', host, ':', id)
          ) thread, 
       SUM(count_star) count_star,
       ROUND(SUM(sum_timer_wait)/1000000000000, 2) total_sec,
       ROUND(MIN(min_timer_wait)/1000000, 2) min_usec,
       ROUND(AVG(avg_timer_wait)/1000000, 2) avg_usec,
       ROUND(MAX(max_timer_wait)/1000000, 2) max_usec
  FROM performance_schema.events_waits_summary_by_thread_by_event_name 
  LEFT JOIN performance_schema.threads USING (thread_id) 
  LEFT JOIN information_schema.processlist ON processlist_id = id
 WHERE event_name LIKE 'wait/io/file/%'
   AND sum_timer_wait > 0
 GROUP BY thread
 ORDER BY total_sec DESC LIMIT 5;

mysql>  select * from top_thread_io_consumers_by_latency;
+----------------------+------------+-----------+----------+----------+----------+
| thread               | count_star | total_sec | min_usec | avg_usec | max_usec |
+----------------------+------------+-----------+----------+----------+----------+
| msandbox@localhost:3 |       2159 |      0.98 |     0.05 |  1271.01 | 56968.61 |
| main:0               |        971 |      0.24 |     1.08 |   257.33 | 14270.73 |
| srv_master_thread:15 |         54 |      0.03 |     0.46 |  1104.93 |  5547.38 |
| io_handler_thread:1  |          8 |      0.02 |     9.90 |  2285.61 | 10963.86 |
| io_handler_thread:7  |          3 |      0.00 |    45.94 |    67.70 |   102.48 |
+----------------------+------------+-----------+----------+----------+----------+
5 rows in set (0.15 sec)

Per Instance – events_waits_summary_by_instance

mysql> select * from events_waits_summary_by_instance 
    ->  where event_name like 'wait/io/file/innodb/%';
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
| EVENT_NAME                           | OBJECT_INSTANCE_BEGIN | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
| wait/io/file/innodb/innodb_data_file |            4680873728 |        364 |   369575256440 |         395850 |     1015316638 |   241104845800 |
| wait/io/file/innodb/innodb_log_file  |            4681729232 |         49 |   130426765430 |        2386410 |     2661770723 |    58808150830 |
| wait/io/file/innodb/innodb_log_file  |            4685789600 |          4 |       27471990 |        5877430 |        6867997 |        8271380 |
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
3 rows in set (0.04 sec)

Finally, we end with all of the events summarized by the instance of the event. With performance_schema, if there are multiple instances of a certain event type, such as two InnoDB log files, or multiple InnoDB data files, or multiple instances of a specific mutex in memory etc. – each of those are also aggregated individually within this table.

In the above example, there is a single InnoDB datafile (ibdata1 – 4680873728), and two InnoDB log files (ib_logfile0 and ib_logfile1).

We have a bit of a disconnect here however within performance_schema – it is not possible at the moment to join the events_waits_summary_by_instance table to the file_summary_by_instance table. file_summary_by_instance uses the file_name column as the key, whilst events_waits_summary_by_instance only exposes the object_instance_begin column. We hope to rectify this within the 5.6 release, by adding the object_instance_begin column to the file_summary_by_instance table (and possibly aggregating more right inline with the other file summary tables).

This would then allow us to drill in by file to both the total amount of IO done, and the total/min/max/avg etc. latency involved in doing so, per file that it accessed from the database. I’m putting together a worklog for this, and will post it to the comments once it is public.

I’ve also now created a ps_helper page dedicated to collecting various views and procedures that help with looking in to performance_schema data. I’ve still got a number of things that I’m playing with that I hope to blog about and publish soon as well. :)