Implementing mysqldump –ignore-database

Ronald Bradford and Giuseppe Maxia (hey guys!) wrote about different ways to ignore a database when using mysqldump –all-databases over the past couple of days.

Whilst the solutions are interesting, I wondered why not attack it from the proper approach, and add the option to mysqldump itself? Honestly, the patch is trivial, and doing anything against INFORMATION_SCHEMA with lots of databases and tables … well let’s just say … group_concat_max_len is the least of your worries..

15 minutes later I had a working solution:

To my surprise, I also found Bug#3228, created a little over 8 years ago.. I’ve posted the patch to the bug. It acts in exactly the same way as the –ignore-table option (specify it multiple times to ignore multiple databases, passing in only the database name for each).

Here’s my little test:

$ mysql -u root -pmysql -h 127.0.0.1 -P 3307 -e "show databases";
+--------------------+
| Database           |
+--------------------+
| information_schema |
| admin              |
| mem__inventory   |
| mysql              |
| performance_schema |
| test               |
+--------------------+
$ ./mysqldump -u root -pmysql -h 127.0.0.1 -P 3307 --all-databases --ignore-database=mysql --ignore-database=mem__inventory
Warning: Using a password on the command line interface can be insecure.
-- MySQL dump 10.13  Distrib 5.6.6-m9, for Win32 (x86)
--
-- Host: 127.0.0.1    Database:
-- ------------------------------------------------------
-- Server version       5.6.5-m8

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

--
-- Current Database: `admin`
--

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `admin` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `admin`;

--
-- Current Database: `test`
--

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `test` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `test`;

--
-- Table structure for table `t1`
--

DROP TABLE IF EXISTS `t1`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `t1` (
  `i` binary(16) NOT NULL,
  `k` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `t1`
--

LOCK TABLES `t1` WRITE;
/*!40000 ALTER TABLE `t1` DISABLE KEYS */;
/*!40000 ALTER TABLE `t1` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

-- Dump completed on 2012-04-19 22:55:02

Finding and killing long running InnoDB transactions with Events

I’ve seen a number of solutions for finding long running transactions or sessions within InnoDB / MySQL now. Every single one of them has (in the past by necessity) been implemented as a script (other than one, more on that one later) that is either invoked manually, or via some cron job, that then connects and tries to find, and possibly diagnose or kill, transactions that break some “long running” threshold. We also have a rule in MySQL Enterprise Monitor that flags this for further inspection.

Solutions like this have one major downfall however – they often require a new connection to the database every time that they want to perform the check (though MEM holds a connection open for monitoring).

Often what you see with many high throughput instances is that once you get a long running transaction that is holding some locks, many sessions soon start to stack up behind this, and in many cases fill up the instance to max_connections. You then play roulette on whether you can get a new connection to start finding and killing the long running transaction(s) or not.

In the past this was a necessity because it was impossible to parse SHOW ENGINE INNODB STATUS from within the database (such as within a procedure), to then build a set of KILL statements programatically (because you couldn’t consume the SHOW output in a meaningful way).

However, with 5.1 and the InnoDB plugin (and hence, 5.5+ by default too), there is the new INFORMATION_SCHEMA.INNODB_TRX table, which lists all currently active transactions within InnoDB, including the same output as SHOW ENGINE INNODB STATUS per transaction, plus a few extras:

mysql> select * from INFORMATION_SCHEMA.INNODB_TRX\G
*************************** 1. row ***************************
                    trx_id: 7C6
                 trx_state: RUNNING
               trx_started: 2011-05-31 11:17:10
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 2
       trx_mysql_thread_id: 11706
                 trx_query: insert into t1 values (sleep(40))
       trx_operation_state: NULL
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 1
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 1
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000

Because we have this nice shiny new table, personally I think it’s time to start throwing away most of those scripts, and using another of the new features in 5.1 – events. The major advantage to this is that it is always running within the instance, it can’t be blocked by any max connection limit, and it doesn’t require any DBA / admin intervention, it’s the perfect BOFH tool. Developers Beware.

So here’s the basis for how I would do it:

CREATE DATABASE IF NOT EXISTS myadmin;
USE myadmin;

DROP TABLE IF EXISTS kill_long_transactions_log;

CREATE TABLE kill_long_transactions_log (
  id INT AUTO_INCREMENT PRIMARY KEY,
  ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  thd BIGINT,
  user VARCHAR(16),
  host VARCHAR(64),
  statement TEXT,
  KEY (ts, user),
  KEY (user),
  KEY (host)
) ENGINE = InnoDB;

DELIMITER $$

DROP EVENT IF EXISTS kill_long_transactions$$

CREATE EVENT kill_long_transactions
ON SCHEDULE EVERY 10 SECOND
DO
  BEGIN
	  DECLARE max_transaction_time INT DEFAULT 10; 
	  DECLARE done INT DEFAULT 0;
	  DECLARE killed_id BIGINT;
	  DECLARE killed_user VARCHAR(16);
	  DECLARE killed_host VARCHAR(64);
	  DECLARE kill_stmt VARCHAR(20);
	  DECLARE running_stmt TEXT;

	  DECLARE long_transactions CURSOR FOR
	   SELECT CONCAT('KILL ', trx.trx_mysql_thread_id) kill_statement,
	          trx.trx_mysql_thread_id thd_id,
	          ps.user,
	          ps.host,
	          trx.trx_query
	     FROM INFORMATION_SCHEMA.INNODB_TRX trx
	     JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id
	    WHERE (UNIX_TIMESTAMP() - UNIX_TIMESTAMP(trx.trx_started)) > max_transaction_time
	      AND user != 'system_user';
	  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
	
	  OPEN long_transactions;
	  
	  kill_loop: LOOP
	    FETCH long_transactions INTO 
	      kill_stmt, killed_id, killed_user, killed_host, running_stmt;
	
	    IF done THEN
	      LEAVE kill_loop;
	    END IF;
	    
	    SET @kill := kill_stmt;
	    
	    PREPARE stmt FROM @kill;
	    EXECUTE stmt;
	    DEALLOCATE PREPARE stmt;
	    
	    INSERT INTO kill_long_transactions_log (thd, user, host, statement)
	    VALUES (killed_id, killed_user, killed_host, running_stmt);
	    
	  END LOOP;
	  CLOSE long_transactions;
	  
	  DELETE FROM kill_long_transactions_log
	   WHERE ts < NOW() - INTERVAL 7 DAY;
  END$$

DELIMITER ;

You can tweak the following from the above to fit your needs:

How often to check:

ON SCHEDULE EVERY 10 SECOND

How long should a transaction be running to get killed:

DECLARE max_transaction_time INT DEFAULT 10;

How long to keep data for before purging:

WHERE ts < NOW() - INTERVAL 7 DAY;

Once you have that created, make sure that the event scheduler is enabled:

mysql> SHOW GLOBAL VARIABLES LIKE 'event_scheduler';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+

And sit back and wait for the magic to happen.

Now if you want to poll for long running transactions, you can check the kill_long_transactions_log table after the fact for anything that has been killed (not should be killed):

mysql> select * from kill_long_transactions_log;
+----+---------------------+-------+----------+-----------+-----------------------------------+
| id | ts                  | thd   | user     | host      | statement                         |
+----+---------------------+-------+----------+-----------+-----------------------------------+
|  1 | 2011-05-31 11:13:15 | 11672 | msandbox | localhost | NULL                              |
|  2 | 2011-05-31 11:14:25 | 11681 | msandbox | localhost | insert into t1 values (sleep(40)) |
|  3 | 2011-05-31 11:16:15 | 11695 | msandbox | localhost | NULL                              |
|  4 | 2011-05-31 11:17:25 | 11706 | msandbox | localhost | insert into t1 values (sleep(40)) |
+----+---------------------+-------+----------+-----------+-----------------------------------+

Sometimes there may be a statement running when the session is killed, other times it could just be idle (and hence NULL), but still have a transaction open.

As a DBA, you might then want to try and find out what those sessions are doing, and in that case, you can disable/enable this event pretty easily at run time:

mysql> ALTER EVENT kill_long_transactions DISABLE;
Query OK, 0 rows affected (0.03 sec)

mysql> ALTER EVENT kill_long_transactions ENABLE;
Query OK, 0 rows affected (0.00 sec)

In < = 5.5, if you want to try and find out what these long running transactions are doing whilst this is disabled, I have a couple of recommendations:

Yoshinori’s solution is one of the only ones that doesn’t require a connection to the database, doing all of it’s sniffing external to the instance, somewhat like MEM’s Query Analyzer (though sniffing network traffic with libpcap instead), however unlike Query Analyzer, it tries to record the entire transaction verbatim, instead of aggregating likewise statements (which can happen with long running transactions that loop to insert data etc.).

I think it would be possible to do the same entirely from within performance_schema as of 5.6, using the events_statements_history table if enabled, from an event as well. This may serve both needs quite nicely, with no external solution required.

Maybe I’ll follow up on that one day.. πŸ™‚

Monitoring MySQL IO Latency with performance_schema

Baron pointed to two really good articles recently by Brendan Gregg on Filesystem Latency (part1 / part2), that I too would recommend everybody read.

They rightly point out that trying to correlate IO statistics from tools such as iostat to a databases workload, especially when that database is on a shared resource, is fraught with complications – and further to that, not the full picture anyway, even on a dedicated system. Brendan points out that to get the full picture, you really need to be instrumenting your profiling at the read/write layer of the application – before getting to per filesystem / disk statistics.

Brendan has followed up on his posts with a third on how to track these filesystem calls from a running MySQL process with DTrace, which again I recommend. In this post Brendan says the following, linking to my blog on graphing statement execution in 5.6 (thanks Brendan):

If you are on MySQL 5.6 and later, it looks like you may also be able to 
get this information from the wait/io events from the new performance schema 
additions (I havenΓ’??t tried).

This is true.

However I’d like to show how filesystem latency can be monitored from the current MySQL 5.5 GA version, with performance_schema, on all platforms.

I’ve written about tracking IO with performance_schema in the past, and in retrospect, realize that I should have talked more about monitoring the latency aspects of IO as well, my bad. πŸ™‚ That post looked at the file IO summary tables.

Now let us take a look at what we can get from the wait event summary tables for IO.

When event timing is enabled within performance_schema, every single wait event is a measure of latency, whether that be to grab a synchronization point, or perform a file IO operation, or even later in 5.6 to take a table lock, perform table IO or network IO etc. – up to executing an entire statement or stage (including all locks, file IO, synch points etc. below them).

From the file IO perspective, all of these are instrumented around the layer that Brendan points to – interactions with the filesystem, not the disks, from within MySQL. You can see exactly which file IO events are instrumented within the performance_schema.setup_instruments table:

mysql> select * from setup_instruments where name like 'wait/io/file/%';
+--------------------------------------+---------+-------+
| NAME                                 | ENABLED | TIMED |
+--------------------------------------+---------+-------+
| wait/io/file/sql/map                 | YES     | YES   |
| wait/io/file/sql/binlog              | YES     | YES   |
| wait/io/file/sql/binlog_index        | YES     | YES   |
| wait/io/file/sql/relaylog            | YES     | YES   |
| wait/io/file/sql/relaylog_index      | YES     | YES   |
...
| wait/io/file/myisam/data_tmp         | YES     | YES   |
| wait/io/file/myisam/dfile            | YES     | YES   |
| wait/io/file/myisam/kfile            | YES     | YES   |
| wait/io/file/myisam/log              | YES     | YES   |
| wait/io/file/myisammrg/MRG           | YES     | YES   |
| wait/io/file/innodb/innodb_data_file | YES     | YES   |
| wait/io/file/innodb/innodb_log_file  | YES     | YES   |
| wait/io/file/innodb/innodb_temp_file | YES     | YES   |
+--------------------------------------+---------+-------+
42 rows in set (0.01 sec)

Each instrument of the wait/io/file type tracks a certain form of IO that is performed by MySQL, such as wait/io/file/innodb/innodb_data_file tracking datafile IO for InnoDB, and wait/io/file/sql/binlog tracking IO to binary logs at the SQL layer.

We do this in much the same way that Brendan does with DTrace, though instead we do this by wrapping the original functions, such as os_file_read(), os_file_write() etc., with functions that do the timing around them.

Each of these original functions are now macros within 5.5. They each call a performance_schema specific inline function that then records the start and end times of the calls, and collects other data specific to the type of wait event, such as which file we are operating on, the kind of file operation, the number of bytes for each, and where the operation was initiated from in the code.

These in turn then call the original, now renamed functions, such as os_file_read_func() and os_file_write_func() (meaning that if you want to use Brendan’s solution with 5.5+, you should probably switch it to use these functions directly). Here’s InnoDB’s pfs_os_file_read_func() as an inline example:

pfs_os_file_read_func(
/*==================*/
	os_file_t	file,	/*!< in: handle to a file */
	void*		buf,	/*!< in: buffer where to read */
	ulint		offset,	/*!< in: least significant 32 bits of file
				offset where to read */
	ulint		offset_high,/*!< in: most significant 32 bits of
				offset */
	ulint		n,	/*!< in: number of bytes to read */
	const char*	src_file,/*!< in: file name where func invoked */
	ulint		src_line)/*!< in: line where the func invoked */	
{	
	ibool	result;
	struct PSI_file_locker*	locker = NULL;
	PSI_file_locker_state	state;
	
	register_pfs_file_io_begin(&state, locker, file, n, PSI_FILE_READ,
				   src_file, src_line);
	
	result = os_file_read_func(file, buf, offset, offset_high, n);
	
	register_pfs_file_io_end(locker, n);

	return(result);
}

Now we have the background out of the way, we can take a look at how the data is exposed.

I’m going to start with the raw stream of data that is exposed within the wait event tables, in the events_waits_history_long table specifically, as this gives the 10,000 most recent events within performance_schema (by default), giving a pretty up to date snapshot of what is happening “now”.

As with some of my previous examples, I’m going to use a helper view to do this. The following gets the latest IO events, also showing the originating thread, operation type, size and latency of each request:

DROP VIEW IF EXISTS latest_file_io;

CREATE VIEW latest_file_io AS
SELECT IF(id IS NULL, 
             CONCAT(SUBSTRING_INDEX(name, '/', -1), ':', thread_id), 
             CONCAT(user, '@', host, ':', id)
          ) thread, 
       SUBSTRING_INDEX(object_name, '/', -1) file, 
       timer_wait/1000000 latency_usec, 
       operation, 
       number_of_bytes bytes
  FROM performance_schema.events_waits_history_long 
  JOIN performance_schema.threads USING (thread_id)
  LEFT JOIN information_schema.processlist ON processlist_id = id
 WHERE object_name IS NOT NULL
 ORDER BY timer_start;

mysql> select * from latest_file_io;
+------------------------+--------------------+--------------+-----------+---------+
| thread                 | file               | latency_usec | operation | bytes   |
+------------------------+--------------------+--------------+-----------+---------+
| msandbox@localhost:119 | ib_logfile1        |     834.7232 | write     | 2035712 |
| msandbox@localhost:119 | ib_logfile1        |     833.2492 | write     | 2035712 |
| msandbox@localhost:119 | ib_logfile0        |      11.2195 | write     |     512 |
| msandbox@localhost:119 | ib_logfile0        |      10.8199 | write     |     512 |
| msandbox@localhost:119 | ib_logfile0        |     716.1718 | write     | 1697792 |
| msandbox@localhost:119 | ib_logfile0        |     715.5535 | write     | 1697792 |
| msandbox@localhost:119 | ib_logfile0        |    8649.3640 | sync      |    NULL |
| msandbox@localhost:119 | ib_logfile1        |   35845.0394 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |     242.0453 | write     |  540672 |
| srv_master_thread:15   | ibdata1            |     241.2235 | write     |  540672 |
| srv_master_thread:15   | ibdata1            |    3337.6979 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |       6.3638 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4863 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5542 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4524 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6032 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6070 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6145 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5240 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5391 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5429 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6899 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.4675 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5316 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5881 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6107 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5203 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5391 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6409 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5730 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5655 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6070 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7465 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6371 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7238 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7578 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7615 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7917 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6899 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7276 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.8219 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.8483 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7465 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.7804 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      27.4456 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      20.4560 | write     |   16384 |
| io_handler_thread:7    | ibdata1            |      25.5342 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      17.5757 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      18.2317 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      19.8604 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      16.9989 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      15.9320 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      16.9801 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      14.6351 | write     |   16384 |
| io_handler_thread:3    | ibdata1            |      15.5362 | write     |   16384 |
| io_handler_thread:7    | ibdata1            |     250.6145 | write     |  360448 |
| srv_master_thread:15   | ibdata1            |    5418.0932 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |      45.0779 | write     |   49152 |
| srv_master_thread:15   | ibdata1            |      44.1052 | write     |   49152 |
| srv_master_thread:15   | ibdata1            |     468.7128 | sync      |    NULL |
| srv_master_thread:15   | ibdata1            |       3.9283 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.6107 | write     |   16384 |
| srv_master_thread:15   | ibdata1            |       0.5429 | write     |   16384 |
| io_handler_thread:6    | ibdata1            |      22.2656 | write     |   16384 |
| io_handler_thread:4    | ibdata1            |      33.1119 | write     |   32768 |
| srv_master_thread:15   | ibdata1            |     757.6984 | sync      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |       6.9217 | stat      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |      13.3496 | open      |    NULL |
| msandbox@localhost:92  | latest_file_io.frm |      18.1299 | read      |    3222 |
| msandbox@localhost:92  | latest_file_io.frm |      50.2993 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |     278.9498 | create    |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |     145.4617 | create    |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      48.4030 | write     |     176 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       7.2120 | write     |     100 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.6456 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.6833 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.5099 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4797 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4420 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       3.4722 | write     |       7 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       9.8058 | chsize    |    1024 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      15.6907 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      98.9889 | close     |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |      13.2063 | open      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |     150.3212 | read      |      24 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       1.9265 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYI |       6.6427 | read      |     339 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      13.3496 | open      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      69.0853 | write     |     696 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       9.8020 | write     |     728 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       6.2620 | write     |     696 |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       4.2488 | tell      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       3.9736 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |       1.2743 | seek      |    NULL |
| msandbox@localhost:92  | #sql3a48_5c_64.MYD |      26.1676 | read      |    2120 |
+------------------------+--------------------+--------------+-----------+---------+
98 rows in set (0.16 sec)

Just for comparison, the first thread in the output was executing this:

mysql> insert into t1 (select * from t1);
Query OK, 65536 rows affected (0.78 sec)
Records: 65536  Duplicates: 0  Warnings: 0

mysql> select connection_id();
+-----------------+
| connection_id() |
+-----------------+
|             119 |
+-----------------+
1 row in set (0.00 sec)

In the version that I am using here (5.5.12) there is a bug in some cases of InnoDB IO (such as writing to the InnoDB log files) that double accounts for IO, so you have to do some manual work to exclude some rows (the second events of the same size but lower times) to find out just how much time was spent in this version. That annoys me. Luckily, we have since tracked down the problem there, and it’s been fixed and should be released in 5.5.13.

That results in 35845.0394 + 8649.3640 + 716.1718 + 11.2195 + 834.7232 = 46056.5179 microseconds, or roughly 0.046 seconds, out of the 0.78 seconds reported to the client. Most of the time at the filesystem layer was in doing an fsync() on the log files (which forced the data to hit the platters), rather than write().

The rest was spent doing “other things” in memory, which are then synced out-of-band by the background threads to the data files / filesystem outside of the query execution’s runtime. This is visible with srv_master_thread first doing a big write to the doublewrite buffer, and then merging those changes to the datafiles in 16K pages.

It’s also worth noting what is happening within the msandbox@localhost:92 thread at the end of the result. That is actually the file IO that was required to return the result of the query that I was executing. There will be a few events after that statement completed to clean them up etc.

The first 4 events against latest_file_io.frm are to get the structure of the view I was selecting against, and the rest are a result of the JOIN against INFORMATION_SCHEMA.PROCESSLIST, which in this case is forcing a temporary table to be created for the processlist result. In 5.6, there is no longer any need for this JOIN, as all the columns are available within the threads table, so the IO could be reduced to just the view definition then.

This section also helps to point out one of the deficiencies with using Brendan’s script (at least without extending it) to really see when filesystem interactions are the causes of overall latency – a large portion of the time spent there was in creating the temporary table files, not just writing to or reading from them. Likewise with the view and opening/closing.

This also helps to illustrate a point that Ronald Bradford has asked me in the past:

My work with the PERFORMANCE_SCHEMA has been limited to experimenting such as 
described by Tracking IO with performance_schema however a recent discussion 
with Mark Leith highlighted that with the performance_schema it will be possible 
to determine the exact size of disk based temporary tables created by the MySQL 
kernel for example.

Sorry for the delay Ronald, but I hope this shows a good example of exactly what kind of IO we track against temporary tables as well. If not, here’s a better one:

mysql> select file, round(sum(bytes)/1024/1204, 2) size_mb, round(sum(latency_usec), 2) latency_usec
    ->   from latest_file_io
    ->  where file like '#sql%' and operation = 'write'
    -> group by file, operation
    -> order by file;
+---------------------+---------+--------------+
| file                | size_mb | latency_usec |
+---------------------+---------+--------------+
| #sql3a48_575_2a.MYI |    0.00 |        10.11 |
| #sql3a48_575_2b.MYD |    2.11 |      7435.86 |
| #sql3a48_575_2b.MYI |    0.00 |        10.81 |
| #sql3a48_575_2c.MYD |    0.00 |        46.16 |
| #sql3a48_575_2c.MYI |    0.00 |        61.27 |
| #sql3a48_575_2d.MYD |    2.30 |      7638.26 |
| #sql3a48_575_2d.MYI |    0.00 |        63.24 |
+---------------------+---------+--------------+
7 rows in set (0.12 sec)

This may be a bit of a hit and miss approach, as this table only stores 10,000 events by default, and when you’re playing with temporary tables sometimes the number of file IO events could easily exceed this. Temporary tables are also very transient, MySQL doesn’t hold the file handlers open nor keep much metadata around once they are removed, so we don’t aggregate their data within the file summary tables at this time.

We do however summarize the events for regular table files, and the other files held open by the instance such as binary logs and log files etc. as shown above.

Each wait is summarized at 3 different levels within 5.5 – here are some examples, starting each with the raw output for the InnoDB file IO events:

Globally by event – events_waits_summary_global_by_event_name

mysql> select * from events_waits_summary_global_by_event_name
    ->  where event_name like 'wait/io/file/innodb/%';
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| EVENT_NAME                           | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| wait/io/file/innodb/innodb_data_file |        364 |   369575256440 |         395850 |     1015316638 |   241104845800 |
| wait/io/file/innodb/innodb_log_file  |         53 |   130454237420 |        2386410 |     2461400706 |    58808150830 |
| wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
+--------------------------------------+------------+----------------+----------------+----------------+----------------+
3 rows in set (0.00 sec)

These allow you to find your top IO consumers by latency, and can easily be joined against the file_summary_by_event_name table, to also show the number of bytes read and written per type of IO:

DROP VIEW IF EXISTS top_global_io_consumers_by_latency;

CREATE VIEW top_global_io_consumers_by_latency AS
SELECT event_name,
       count_star,
       ROUND(sum_timer_wait/1000000000000, 2) total_sec,
       ROUND(min_timer_wait/1000000, 2) min_usec,
       ROUND(avg_timer_wait/1000000, 2) avg_usec,
       ROUND(max_timer_wait/1000000, 2) max_usec,
       count_read,
       ROUND(sum_number_of_bytes_read/1024/1024, 2) read_mb,
       count_write,
       ROUND(sum_number_of_bytes_write/1024/1024, 2) write_mb
  FROM performance_schema.events_waits_summary_global_by_event_name 
  JOIN performance_schema.file_summary_by_event_name USING (event_name) 
 WHERE event_name LIKE 'wait/io/file/%' 
 ORDER BY sum_timer_wait DESC LIMIT 5;

mysql> select * from ps_helper.top_global_io_consumers_by_latency;
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
| event_name                           | count_star | total_sec | min_usec | avg_usec | max_usec  | count_read | read_mb | count_write | write_mb |
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
| wait/io/file/myisam/dfile            |    1828445 |     10.82 |     0.74 |     5.92 | 361525.96 |    1250257 | 2094.98 |      577762 |  1019.34 |
| wait/io/file/myisam/kfile            |       1212 |      0.34 |     1.12 |   283.54 | 109499.59 |        132 |    0.03 |         696 |     0.02 |
| wait/io/file/innodb/innodb_data_file |        407 |      0.27 |     7.36 |   652.37 |  17684.12 |        397 |   10.14 |           4 |     0.06 |
| wait/io/file/sql/file_parser         |        108 |      0.15 |     1.25 |  1355.50 |  41132.89 |         20 |    0.03 |           5 |     0.02 |
| wait/io/file/sql/FRM                 |       1027 |      0.12 |     1.08 |   115.64 |  41406.17 |        431 |    0.14 |           0 |     0.00 |
+--------------------------------------+------------+-----------+----------+----------+-----------+------------+---------+-------------+----------+
5 rows in set (0.11 sec)

This is my test instance, so there is no great amount of IO going on. It shows in this example that MyISAM data file IO is the top consumer, followed by MyISAM index IO, InnoDB data file IO, view parsing, and frm file parsing respectively.

Per Thread by event – events_waits_summary_by_thread_by_event_name

mysql> select * from events_waits_summary_by_thread_by_event_name 
    ->  where event_name like 'wait/io/file/innodb/%';
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
| THREAD_ID | EVENT_NAME                           | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
|         0 | wait/io/file/innodb/innodb_data_file |        297 |    17541591340 |        9183720 |       59062597 |     2244695700 |
|         0 | wait/io/file/innodb/innodb_log_file  |         25 |      594694880 |        2386410 |       23787795 |      291044000 |
|         0 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_data_file |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_log_file  |          0 |              0 |              0 |              0 |              0 |
|         3 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
...
|         1 | wait/io/file/innodb/innodb_data_file |          0 |              0 |              0 |              0 |              0 |
|         1 | wait/io/file/innodb/innodb_log_file  |          0 |              0 |              0 |              0 |              0 |
|         1 | wait/io/file/innodb/innodb_temp_file |          0 |              0 |              0 |              0 |              0 |
+-----------+--------------------------------------+------------+----------------+----------------+----------------+----------------+
51 rows in set (0.00 sec)

These are the events summarized by thread, so that you can trace which threads – either foreground or background – are likely to be consuming the most IO. This is just from a latency perspective however, and can not easily be correlated back to the file summary tables in a meaningful way. This means you can not easily correlate the amount of IO done with the time taken per thread (unless you are polling the current / history tables, if they are enabled).

However, it can provide insight in to who is suffering the most from filesystem latency:

DROP VIEW IF EXISTS top_thread_io_consumers_by_latency;

CREATE VIEW top_thread_io_consumers_by_latency AS
SELECT IF(id IS NULL, 
             CONCAT(SUBSTRING_INDEX(name, '/', -1), ':', thread_id), 
             CONCAT(user, '@', host, ':', id)
          ) thread, 
       SUM(count_star) count_star,
       ROUND(SUM(sum_timer_wait)/1000000000000, 2) total_sec,
       ROUND(MIN(min_timer_wait)/1000000, 2) min_usec,
       ROUND(AVG(avg_timer_wait)/1000000, 2) avg_usec,
       ROUND(MAX(max_timer_wait)/1000000, 2) max_usec
  FROM performance_schema.events_waits_summary_by_thread_by_event_name 
  LEFT JOIN performance_schema.threads USING (thread_id) 
  LEFT JOIN information_schema.processlist ON processlist_id = id
 WHERE event_name LIKE 'wait/io/file/%'
   AND sum_timer_wait > 0
 GROUP BY thread
 ORDER BY total_sec DESC LIMIT 5;

mysql>  select * from top_thread_io_consumers_by_latency;
+----------------------+------------+-----------+----------+----------+----------+
| thread               | count_star | total_sec | min_usec | avg_usec | max_usec |
+----------------------+------------+-----------+----------+----------+----------+
| msandbox@localhost:3 |       2159 |      0.98 |     0.05 |  1271.01 | 56968.61 |
| main:0               |        971 |      0.24 |     1.08 |   257.33 | 14270.73 |
| srv_master_thread:15 |         54 |      0.03 |     0.46 |  1104.93 |  5547.38 |
| io_handler_thread:1  |          8 |      0.02 |     9.90 |  2285.61 | 10963.86 |
| io_handler_thread:7  |          3 |      0.00 |    45.94 |    67.70 |   102.48 |
+----------------------+------------+-----------+----------+----------+----------+
5 rows in set (0.15 sec)

Per Instance – events_waits_summary_by_instance

mysql> select * from events_waits_summary_by_instance 
    ->  where event_name like 'wait/io/file/innodb/%';
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
| EVENT_NAME                           | OBJECT_INSTANCE_BEGIN | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
| wait/io/file/innodb/innodb_data_file |            4680873728 |        364 |   369575256440 |         395850 |     1015316638 |   241104845800 |
| wait/io/file/innodb/innodb_log_file  |            4681729232 |         49 |   130426765430 |        2386410 |     2661770723 |    58808150830 |
| wait/io/file/innodb/innodb_log_file  |            4685789600 |          4 |       27471990 |        5877430 |        6867997 |        8271380 |
+--------------------------------------+-----------------------+------------+----------------+----------------+----------------+----------------+
3 rows in set (0.04 sec)

Finally, we end with all of the events summarized by the instance of the event. With performance_schema, if there are multiple instances of a certain event type, such as two InnoDB log files, or multiple InnoDB data files, or multiple instances of a specific mutex in memory etc. – each of those are also aggregated individually within this table.

In the above example, there is a single InnoDB datafile (ibdata1 – 4680873728), and two InnoDB log files (ib_logfile0 and ib_logfile1).

We have a bit of a disconnect here however within performance_schema – it is not possible at the moment to join the events_waits_summary_by_instance table to the file_summary_by_instance table. file_summary_by_instance uses the file_name column as the key, whilst events_waits_summary_by_instance only exposes the object_instance_begin column. We hope to rectify this within the 5.6 release, by adding the object_instance_begin column to the file_summary_by_instance table (and possibly aggregating more right inline with the other file summary tables).

This would then allow us to drill in by file to both the total amount of IO done, and the total/min/max/avg etc. latency involved in doing so, per file that it accessed from the database. I’m putting together a worklog for this, and will post it to the comments once it is public.

I’ve also now created a ps_helper page dedicated to collecting various views and procedures that help with looking in to performance_schema data. I’ve still got a number of things that I’m playing with that I hope to blog about and publish soon as well. πŸ™‚

Monitoring Table and Index IO with PERFORMANCE_SCHEMA

Another new feature of performance_schema within 5.6 is the ability to monitor both table and index IO – this was added in two parts:

If you read my blog on graphing statement wait times – some of the data that WL#4895 exposes is seen within the examples there, they are within the “light coral” coloured blocks, such as the following from the InnoDB PK lookup example I gave:

Index IO Example
Index IO Example

This shows the level of detail you can get per table IO operation, per statement.

However, as Mark Callaghan noted recently, some of the more valuable insight on how tables / indexes are accessed can be gleaned by looking at these statistics, with some further aggregation, over time.

This is what WL#5379 covers, and what I’ll focus on here. It implements two new tables within performance_schema:

Both tables have this basic structure (although table_io_waits_summary_by_index_usage also adds an INDEX_NAME column, and performs it’s aggregation grouping by the index name as well):

mysql> DESC table_io_waits_summary_by_table;
+------------------+---------------------+------+-----+---------+-------+
| Field            | Type                | Null | Key | Default | Extra |
+------------------+---------------------+------+-----+---------+-------+
| OBJECT_TYPE      | varchar(64)         | YES  |     | NULL    |       |
| OBJECT_SCHEMA    | varchar(64)         | YES  |     | NULL    |       |
| OBJECT_NAME      | varchar(64)         | YES  |     | NULL    |       |
| COUNT_STAR       | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_WAIT   | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_WAIT   | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_WAIT   | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_WAIT   | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_READ       | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_READ   | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_READ   | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_READ   | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_READ   | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_WRITE      | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_WRITE  | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_WRITE  | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_WRITE  | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_WRITE  | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_FETCH      | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_FETCH  | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_FETCH  | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_FETCH  | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_FETCH  | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_INSERT     | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_INSERT | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_INSERT | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_INSERT | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_INSERT | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_UPDATE     | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_UPDATE | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_UPDATE | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_UPDATE | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_UPDATE | bigint(20) unsigned | NO   |     | NULL    |       |
| COUNT_DELETE     | bigint(20) unsigned | NO   |     | NULL    |       |
| SUM_TIMER_DELETE | bigint(20) unsigned | NO   |     | NULL    |       |
| MIN_TIMER_DELETE | bigint(20) unsigned | NO   |     | NULL    |       |
| AVG_TIMER_DELETE | bigint(20) unsigned | NO   |     | NULL    |       |
| MAX_TIMER_DELETE | bigint(20) unsigned | NO   |     | NULL    |       |
+------------------+---------------------+------+-----+---------+-------+

We aggregate usage data at three different levels within these tables, per table / index, for your convenience:

All ops, with the COUNT_STAR and *_WAIT columns
   Read ops, with the *_READ columns
       Fetch ops, with the *_FETCH columns
   Write ops, with the *_WRITE columns
       Insert ops, with the *_INSERT columns
       Update ops, with the *_UPDATE columns
       Delete ops, with the *_DELETE columns

For those of you that are used to the TABLE_STATISTICS / INDEX_STATISTICS patches from Google, or one of our “siblings” that has incorporated these, you may see the similarity in some of these statistics.

Mark also noted about these patches:

It also is very easy to use — select * from easy_to_remember_table_name. I hope the P_S can also provide that.

I think people will get used to the names of the tables over time (I certainly have, in the testing that I have been doing). However, if these tables either are forgettable (erm, SHOW TABLES, or use a client with auto-completing), or provide too much data by default, or you already have tooling around the Google patches, I would suggest using something that databases are good at – Views.

Here’s how to return the same data as the TABLE_STATISTICS patch:

CREATE DATABASE IF NOT EXISTS ps_helper;
USE ps_helper;

DROP VIEW IF EXISTS table_statistics;

CREATE VIEW table_statistics AS 
SELECT pst.object_schema AS table_schema, 
       pst.object_name AS table_name, 
       pst.count_read AS rows_read, 
       pst.count_write AS rows_changed,
       (pst.count_write * COUNT(psi.index_name)) AS rows_changed_x_indexes
  FROM performance_schema.table_io_waits_summary_by_table AS pst
  LEFT JOIN performance_schema.table_io_waits_summary_by_index_usage AS psi 
    ON pst.object_schema = psi.object_schema AND pst.object_name = psi.object_name
   AND psi.index_name IS NOT NULL
 GROUP BY pst.object_schema, pst.object_name
 ORDER BY pst.sum_timer_wait DESC;

mysql> SELECT * FROM ps_helper.table_statistics LIMIT 5;
+--------------+-----------------------+-----------+--------------+------------------------+
| table_schema | table_name            | rows_read | rows_changed | rows_changed_x_indexes |
+--------------+-----------------------+-----------+--------------+------------------------+
| mysql        | time_zone_leap_second |        23 |            0 |                      0 |
| test         | t1                    |         4 |            1 |                      2 |
| mysql        | slave_master_info     |         1 |            0 |                      0 |
| mysql        | user                  |         5 |            0 |                      0 |
| mysql        | db                    |         3 |            0 |                      0 |
+--------------+-----------------------+-----------+--------------+------------------------+
5 rows in set (0.00 sec)

This solution also orders the results by the SUM_TIMER_WAIT column for each table in the background, which is where we offer more insight in to table usage with performance_schema.

Not only do we count the operations against the tables, but we can also time them as well (if timing is enabled), allowing you to track higher contended tables, not just highly used tables – the result always returns the highest overhead tables at the top, making “top N” queries much easier than with TABLE_STATISTICS (which appears to return rows in some random order).

Personally I’d be tempted to add more columns from our tables though – breaking down by operation type for a start.

And for INDEX_STATISTICS:

DROP VIEW IF EXISTS index_statistics;

CREATE VIEW index_statistics AS
SELECT object_schema AS table_schema,
       object_name AS table_name,
       index_name,
       count_read AS rows_read
  FROM performance_schema.table_io_waits_summary_by_index_usage
 WHERE index_name IS NOT NULL
 ORDER BY sum_timer_wait DESC;

mysql> select * from index_statistics limit 5;
+--------------+-----------------------+------------+-----------+
| table_schema | table_name            | index_name | rows_read |
+--------------+-----------------------+------------+-----------+
| mysql        | time_zone_leap_second | PRIMARY    |        23 |
| mysql        | slave_master_info     | PRIMARY    |         1 |
| test         | t1                    | PRIMARY    |         1 |
| test         | t1                    | j          |         1 |
| mysql        | tables_priv           | PRIMARY    |         1 |
+--------------+-----------------------+------------+-----------+
5 rows in set (0.01 sec)

Again, we order by the SUM_TIMER_WAIT column, to find the highest overhead indexes first, for an easy “top N”.

However, there is another WHERE clause here – WHERE index_name IS NOT NULL – which is worth mentioning. Within the table_io_waits_summary_by_index_usage table, not only do we aggregate operations per index, we also aggregate how many operations did not use indexes when accessing the tables. We do this with an extra row per table, where the INDEX_NAME column is NULL:

mysql> SHOW CREATE TABLE test.t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `i` int(11) NOT NULL AUTO_INCREMENT,
  `j` varchar(10) DEFAULT NULL,
  `k` int(11) DEFAULT NULL,
  PRIMARY KEY (`i`),
  KEY `j` (`j`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> SELECT object_schema, 
    ->        object_name, 
    ->        index_name, 
    ->        count_fetch,
    ->        count_insert,
    ->        count_update,
    ->        count_delete
    ->   FROM performance_schema.table_io_waits_summary_by_index_usage 
    ->  WHERE object_schema = 'test' 
    ->    AND object_name = 't1';
+---------------+-------------+------------+-------------+--------------+--------------+--------------+
| object_schema | object_name | index_name | count_fetch | count_insert | count_update | count_delete |
+---------------+-------------+------------+-------------+--------------+--------------+--------------+
| test          | t1          | PRIMARY    |           9 |            0 |            0 |            0 |
| test          | t1          | j          |       30803 |            0 |            0 |            0 |
| test          | t1          | NULL       |         101 |        30722 |            4 |            3 |
+---------------+-------------+------------+-------------+--------------+--------------+--------------+
3 rows in set (0.01 sec)

INSERT statements will always be counted against the “null row” per table (an INSERT never directly uses an index, although it may add to them). Currently for the *_UPDATE and *_DELETE columns, we also seem to be putting all aggregation in to the “null row” as well – though I think this is a bug (it is at least undefined in the WL), so I’ve opened Bug#60905 for this. That would allow us to also get more information on which specific access methods, per table, are causing the scans.

Like the Google patches, it is now easy to find out which indexes are not being used as well. However, the performance_schema implementation allows this in a much more elegant way. Where previously you would have had to LEFT JOIN against the INFORMATION_SCHEMA.STATISTICS table as Vadim shows in his example – which incurs a huge penalty on systems with a large number of tables, just look at the statement runtime in the examples there – performance_schema has one row per index in each table, no matter whether it has been read or not, and does not need to hit the disk to gather information around the tables – all of it’s data is entirely in memory, no high overhead JOIN operations are necessary.

To give a view example again:

DROP VIEW IF EXISTS unused_indexes;
 
CREATE VIEW unused_indexes AS
SELECT object_schema,
       object_name,
       index_name
  FROM performance_schema.table_io_waits_summary_by_index_usage 
 WHERE index_name IS NOT NULL
   AND count_star = 0
 ORDER BY object_schema, object_name;

mysql> SELECT * FROM ps_helper.unused_indexes WHERE object_schema = 'test';
+---------------+-------------+------------+
| object_schema | object_name | index_name |
+---------------+-------------+------------+
| test          | t2          | PRIMARY    |
| test          | t2          | j          |
+---------------+-------------+------------+
2 rows in set (0.00 sec)

But given the extra data within performance_schema, it is also possible to find the tables which cause full table scans:


DROP VIEW IF EXISTS table_scans;

CREATE VIEW table_scans AS
SELECT object_schema, 
       object_name,
       count_read AS rows_full_scanned
  FROM performance_schema.table_io_waits_summary_by_index_usage 
 WHERE index_name IS NULL
   AND count_read > 0
 ORDER BY count_read DESC;

mysql> SELECT * FROM ps_helper.table_scans WHERE object_schema = 'test';
+---------------+-------------+-------------------+
| object_schema | object_name | rows_full_scanned |
+---------------+-------------+-------------------+
| test          | t1          |                 5 |
| test          | t2          |                 3 |
+---------------+-------------+-------------------+
2 rows in set (0.01 sec)

We also track temporary tables within this instrumentation:

mysql> SELECT * 
    ->   FROM performance_schema.table_io_waits_summary_by_table 
    ->  WHERE  object_type = 'TEMPORARY TABLE'\G
*************************** 1. row ***************************
     OBJECT_TYPE: TEMPORARY TABLE
   OBJECT_SCHEMA: test
     OBJECT_NAME: t1_temp
      COUNT_STAR: 18
  SUM_TIMER_WAIT: 117369000
  MIN_TIMER_WAIT: 117180
  AVG_TIMER_WAIT: 6520500
  MAX_TIMER_WAIT: 31952340
      COUNT_READ: 12
  SUM_TIMER_READ: 94311000
  MIN_TIMER_READ: 117180
  AVG_TIMER_READ: 7858998
  MAX_TIMER_READ: 31952340
     COUNT_WRITE: 6
 SUM_TIMER_WRITE: 23058000
 MIN_TIMER_WRITE: 147420
 AVG_TIMER_WRITE: 3842748
 MAX_TIMER_WRITE: 13400100
...

All on top of the per-wait information that can be tied to threads, stages, and statements! Possibilities abound! πŸ™‚

MySQL system status snapshots using the MySQL Enterprise Monitor Agent

MySQL Enterprise Monitor collects a huge amount of information about MySQL and the systems that it runs on. To do this, it employs an “Agent” to collect these statistics.

This can either sit locally to the database server, or on a remote host – perhaps even the same host as the Dashboard server if you decide you don’t want to distribute it to many boxes – and checks when it is local or remote, to decide whether to collect OS statistics or not. A single agent can monitor either a single or multiple instances from the same process.

Not all of the information that it can collect is fully exposed within the UI yet however (there are so many things to do lately!), so some people don’t realize the kind of data that we can gather on top of the default that is exposed. Another thing, that is even more non-obvious, is that the MEM Agent also collects all of the data for the various things that it has scheduled to run (either statements against MySQL, or calls to SIGAR for abstracted OS data, or data from MySQL Proxy for Query Analysis), even if it isn’t actually required to send all of that data back to the Dashboard server.

It keeps all of this as an internal “inventory”, in what we term the “item hash”. A few months ago, whilst trying to debug some memory issues, Jan Kneschke (of MySQL Proxy and lighttpd fame) implemented a way to dump this item hash in full, using YAML, to a file local to the agent – by sending a SIGUSR2 signal to the running process:

 
$ killall -USR2 mysql-monitor-agent

The agent then dumps it’s item hash structure to a file in the /tmp directory, with the date and time in the name, such as /tmp/mysql-monitor-agent-items.dump-20110414-124656.txt.

I’d been playing with this recently, and realized that this also allows you to maintain a local set of snapshots of performance and configuration of the entire system on a regular basis as well. Having the MEM UI is wonderful, but sometimes it’s handy to have something to {grep/sed/awk/your favorite} on the shell of the system you are trying to diagnose as well.

So a quick cron job later:

$ crontab -l
* * * * * killall -USR2 mysql-monitor-agent
$ ll /tmp/mysql-monitor-agent-items.dump*
-rw-rw-rw-  1 mark  wheel  88332 14 Apr 13:05 /tmp/mysql-monitor-agent-items.dump-20110414-130500.txt
-rw-rw-rw-  1 mark  wheel  88331 14 Apr 13:06 /tmp/mysql-monitor-agent-items.dump-20110414-130600.txt
-rw-rw-rw-  1 mark  wheel  88332 14 Apr 13:07 /tmp/mysql-monitor-agent-items.dump-20110414-130700.txt
-rw-rw-rw-  1 mark  wheel  88332 14 Apr 13:08 /tmp/mysql-monitor-agent-items.dump-20110414-130800.txt
...

Here’s an example of one of the above files, which is monitoring a stand alone 5.5.10 instance on my macbook.

The file starts off with some info on the agent process itself, followed by the output of various SQL statements that we run (those in the items-mysql-monitor.xml or custom.xml files), and then the main data collection items for the various in-built collectors. To point out some of the interesting parts, you can get:

  • All of the parsed variables from SHOW ENGINE INNODB STATUS
  •  "innodbstatus": 
        attributes:
          "innodb_bp_add_alloc": int
          "innodb_bp_created_per_sec": float
          "innodb_bp_db_pages": int
          "innodb_bp_dictionary_alloc": int
    ...
        instances:
          "5a41ac04-115d-448f-a0ca-144eeb5ac8b5": 
    ...
            values:
              "innodb_bp_add_alloc": 0
              "innodb_bp_created_per_sec": 0.000000
              "innodb_bp_db_pages": 191
              "innodb_bp_dictionary_alloc": 63928
    
  • SHOW GLOBAL STATUS
  •  
      "status": 
        attributes:
    ...
          "Bytes_received": int
          "Bytes_sent": int
    ...
        instances:
    ...
            values:
    ...
              "Bytes_received": 6073208
              "Bytes_sent": 70190554
    
  • SHOW GLOBAL VARIABLES
  •  
      "variables": 
        attributes:
          "auto_increment_increment": int
          "auto_increment_offset": int
    ...
        instances:
    ...
            values:
              "auto_increment_increment": 1
              "auto_increment_offset": 1
    
  • SHOW PROCESSLIST
  •  
      "processlist": 
    ...
        instances:
    ...
            values:
              "Command": "Query"
              "Host": "localhost:51682"
              "Id": 4
              "Info": "SHOW /*!40000 FULL */ PROCESSLIST"
              "State": undefined
              "Time": 0
              "User": "msandbox"
              "db": undefined
    ...
    
  • CPU details (the counters are CPU ticks), for each CPU in the system (counting from 0 upwards)
  •  
    "os": 
      "cpu": 
        attributes:
          "cpu_cache_size": int
          "cpu_idle": int
    ...
        instances:
          "ssh:{86:17:98:e8:aa:c5:55:11:f8:a4:76:07:78:fb:5a:1e}.0": 
    ...
            values:
              "cpu_cache_size": 6144
              "cpu_idle": 445505030
    ...
          "ssh:{86:17:98:e8:aa:c5:55:11:f8:a4:76:07:78:fb:5a:1e}.1": 
    ...
            values:
              "cpu_cache_size": 6144
              "cpu_idle": 442250880
    ...
    
  • Disk IO information
  •  
      "disk": 
        attributes:
          "disk_bytes_read": int
          "disk_bytes_written": int
    ...
        instances:
          "ssh:{86:17:98:e8:aa:c5:55:11:f8:a4:76:07:78:fb:5a:1e}./dev/disk0s2": 
    ...
            values:
              "disk_bytes_read": 43988611584
              "disk_bytes_written": 101020259328
    ...
          "ssh:{86:17:98:e8:aa:c5:55:11:f8:a4:76:07:78:fb:5a:1e}./dev/disk2s3": 
    ...
            values:
              "disk_bytes_read": 12241350656
              "disk_bytes_written": 18404140032
    

    Without even showing the output you can get from SHOW SLAVE STATUS, SHOW MASTER STATUS, SHOW MASTER LOGS, mountpoint space usage information, memory usage, network interface information, load averages, OS details, all the info from the various SQL statement results, Query Analysis data (if enabled) – all in a single snapshot of data.

    So now if you’re interested how something has been running over time, and you can’t get to a MEM UI for some reason – or want to do some local analysis using your own scripts:

     
    $ grep -ih com_show_variables /tmp/mysql-monitor-agent-items* | grep -v int
              "Com_show_variables": 2781
              "Com_show_variables": 2782
              "Com_show_variables": 2783
              "Com_show_variables": 2784
              "Com_show_variables": 2785
    ...
    

    Or if somebody tells you some hiccup happened at 12:46 today – you can go and take a look at this file as well!

    Most of the data is collected in one minute intervals – especially all of the OS data, and the output of SHOW [VARIABLES | STATUS | INNODB STATUS | MASTER STATUS | SLAVE STATUS]. Others like some of the SQL statements are run on different intervals (some up to 6 hours or more apart), so these may not be super up to date for the current snapshot.

    SHOW PROCESSLIST by default only runs once on startup and is not scheduled for anything further by the MEM UI yet, so that will not auto-update. You can add a rule that uses one of the items from it’s output to force this however, such as mysql:processlist:time, using a dummy “instance” for the variable such as 0 (the instance in the processlist case is the connection id from SHOW PROCESSLIST), so that no instances are ever pulled back to MEM, wasting space there. Schedule that dummy rule for once a minute as well, and you should get the full output of SHOW PROCESSLIST in your item dump as well.

    What a difference Atomics can make

    Following up to my previous blog on graphing statement execution in performance_schema, Sunny Bains on the InnoDB team pointed out that in looking at the INSERT graph, he didn’t think I had atomic operations enabled within my build.

    Particularly here (from trunk):

    225 /******************************************************************//**
    226 Increments lock_word the specified amount and returns new value.
    227 @return lock->lock_word after increment */
    228 UNIV_INLINE
    229 lint
    230 rw_lock_lock_word_incr(
    231 /*===================*/
    232         rw_lock_t*      lock,           /*!< in/out: rw-lock */
    233         ulint           amount)         /*!< in: amount of increment */
    234 {
    235 #ifdef INNODB_RW_LOCKS_USE_ATOMICS
    236         return(os_atomic_increment_lint(&lock->lock_word, amount));
    237 #else /* INNODB_RW_LOCKS_USE_ATOMICS */
    238         lint local_lock_word;
    239 
    240         mutex_enter(&(lock->mutex));    < === See this a lot in his graph
    241 
    242         lock->lock_word += amount;
    243         local_lock_word = lock->lock_word;
    244 
    245         mutex_exit(&(lock->mutex));
    246 
    247         return(local_lock_word);
    248 #endif /* INNODB_RW_LOCKS_USE_ATOMICS */
    249 }
    

    He was of course right, I was building on OS X 10.5, which ships with this gcc version by default:

    Cerberus:lp-mysql-trunk mark$ gcc --version
    i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5493)
    

    InnoDB supports atomic operations with GCC 4.1.2 or later, so a quick update of gcc using ports (sudo port install gcc45) later:

    Cerberus:lp-mysql-trunk mark$ export CC=/opt/local/bin/gcc-mp-4.5
    Cerberus:lp-mysql-trunk mark$ export CXX=/opt/local/bin/gcc-mp-4.5
    Cerberus:lp-mysql-trunk mark$ cmake . && make
    -- The C compiler identification is GNU
    -- The CXX compiler identification is GNU
    -- The CXX compiler identification is GNU
    ...
    -- Performing Test HAVE_GCC_ATOMIC_BUILTINS
    -- Performing Test HAVE_GCC_ATOMIC_BUILTINS - Success
    ...
    -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS
    -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS - Success
    -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC
    -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC - Success
    ...

    Re-running the same tests that I had done before to generate the graphs really does tell a picture, even when just looking at the thumbnail view:

    InnoDB INSERT no atomics
    InnoDB INSERT no atomics
    InnoDB INSERT atomics
    InnoDB INSERT atomics

    The timings may not be a true reflection of the consistent overhead (this is my macbook, which does a lot of other work whilst I’m testing this stuff out), but just looking at the difference in numbers of synchronization points, it’s clear that there will be far less contention (and hence potential wait time) in the atomics cases – obviously.

    So what’s the lesson we can learn from this?

    If you see a large number of waits for either the wait/synch/mutex/innodb/rw_lock_mutex or wait/synch/mutex/innodb/thr_local_mutex, you may want to investigate whether the build you are using for your platform has atomic operations enabled. If not, then consider finding a platform or build that does support them.

    These are instrumented within performance_schema within the current 5.5 GA release as well, so you can tell whether these are one of your top waits by simply looking at the summary of all mutex waits for InnoDB globally:

    Cerberus:~ mark$ ~/sandboxes/msb_5_5_10/use
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 2
    Server version: 5.5.10 MySQL Community Server (GPL)
    
    Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
    
    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.
    
    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    
    mysql> SELECT * 
        ->   FROM performance_schema.events_waits_summary_global_by_event_name 
        ->  WHERE event_name like 'wait/synch/mutex/innodb/%' 
        ->  ORDER BY sum_timer_wait DESC;
    +-------------------------------------------------------+------------+----------------+----------------+----------------+----------------+
    | EVENT_NAME                                            | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
    +-------------------------------------------------------+------------+----------------+----------------+----------------+----------------+
    | wait/synch/mutex/innodb/mutex_list_mutex              |       8367 |      382312980 |          34020 |          45692 |         158760 |
    | wait/synch/mutex/innodb/rw_lock_list_mutex            |       8286 |      380714040 |          34020 |          45946 |        3549420 |
    | wait/synch/mutex/innodb/kernel_mutex                  |        848 |      326206440 |          34020 |         384677 |       93589020 |
    | wait/synch/mutex/innodb/log_sys_mutex                 |       1658 |      222532380 |          34020 |         134217 |        3961440 |
    | wait/synch/mutex/innodb/rseg_mutex                    |       2847 |      204581160 |          34020 |          71858 |         283500 |
    | wait/synch/mutex/innodb/buf_pool_mutex                |       3204 |      160389180 |          34020 |          50059 |         298620 |
    | wait/synch/mutex/innodb/fil_system_mutex              |       2022 |      119584080 |          34020 |          59141 |         298620 |
    | wait/synch/mutex/innodb/flush_list_mutex              |        773 |       29253420 |          34020 |          37844 |         136080 |
    | wait/synch/mutex/innodb/recv_sys_mutex                |        166 |       10069920 |          45360 |          60662 |         147420 |
    | wait/synch/mutex/innodb/log_flush_order_mutex         |        147 |        6486480 |          34020 |          44125 |         173880 |
    | wait/synch/mutex/innodb/purge_sys_mutex               |         62 |        3938760 |          41580 |          63528 |         124740 |
    | wait/synch/mutex/innodb/trx_undo_mutex                |         31 |        3424680 |          34020 |         110473 |         181440 |
    | wait/synch/mutex/innodb/dict_sys_mutex                |         37 |        3073140 |          41580 |          83057 |         321300 |
    | wait/synch/mutex/innodb/trx_doublewrite_mutex         |         59 |        2683800 |          34020 |          45488 |         158760 |
    | wait/synch/mutex/innodb/autoinc_mutex                 |         24 |        2540160 |          56700 |         105840 |         215460 |
    | wait/synch/mutex/innodb/thr_local_mutex               |         11 |         774900 |          41580 |          70445 |         136080 |
    | wait/synch/mutex/innodb/srv_innodb_monitor_mutex      |         10 |         627480 |          41580 |          62748 |          94500 |
    | wait/synch/mutex/innodb/innobase_share_mutex          |          3 |         340200 |          60480 |         113400 |         173880 |
    | wait/synch/mutex/innodb/file_format_max_mutex         |          5 |         272160 |          34020 |          54432 |          86940 |
    | wait/synch/mutex/innodb/ibuf_mutex                    |          3 |         117180 |          34020 |          39060 |          41580 |
    | wait/synch/mutex/innodb/rw_lock_mutex                 |          0 |              0 |              0 |              0 |              0 |
    | wait/synch/mutex/innodb/syn_arr_mutex                 |          0 |              0 |              0 |              0 |              0 |
    | wait/synch/mutex/innodb/srv_dict_tmpfile_mutex        |          0 |              0 |              0 |              0 |              0 |
    ...
    

    This is a standard MySQL 5.5.10 build for OS X, which has had similar kinds of statements run against it. We can see that wait/synch/mutex/innodb/rw_lock_mutex has had 0 waits, and wait/synch/mutex/innodb/thr_local_mutex has only had 11 waits globally. None are anywhere near close to the top wait events. This build most certainly does have atomics enabled.

    So for those of you running 5.5 in production now:

    What are your top waits?

    A Big Bag of Epic Awesomeness

    I tried to come up with a number of topics for this post, but none seemed to really convey what I really feel.. And really this blog is about all of them..

    • A look at PERFORMANCE_SCHEMA.STAGES and PERFORMANCE_SCHEMA.STATEMENTS
    • Graphing MySQL Statement Execution
    • A MySQL visual EXPLAIN plan with PERFORMANCE_SCHEMA
    • Tracing Sessions with PERFORMANCE_SCHEMA

    If that doesn’t whet your appetite (and trust me, I need to, this post is long, but I feel is worth reading all the way to the end), then let me start out by asking the question:

    Wouldn’t you like to be able to trace what a SQL statement did, either in the same or another session, on a production instance, after the fact? Wouldn’t you like to know where all of it’s time was spent, and some statistics on what it was doing specifically?

    I know, I know, you can do this with the slow query log to some extent, you can do it with external tools like MySQL Enterprise Monitor to some extent… But these are all a very high level. How about something like:

    InnoDB PK Lookup
    InnoDB PK Lookup

    Yes? Not sure what you’re seeing here anyway? Think this is an April Fools? Well then read on…

    As Marc Alff announced recently, performance_schema development within mysql-trunk (5.6.x) has come on in leaps and bounds, with 4 new levels of instrumentation added. The first two that Marc lists, table IO and table lock instrumentation, are in the same class as previous instrumentation – wait/. This means that they are instantly tracked and aggregated in the same way as previous 5.5 instrumentation, in the wait events and events waits summary tables (although they too have new summary tables as well, I won’t talk about those here).

    The next two however, stages and statement instrumentation are each individually new classes of instrumentation on their own. This means they get a new set of tables that track their events in both raw and aggregate form. Here’s a recent table listing of a build from launchpad (for comparison, performance_schema in 5.5 has 17 tables):

    mysql> show tables;
    +---------------------------------------------------+
    | Tables_in_performance_schema                      |
    +---------------------------------------------------+
    | cond_instances                                    |
    | events_stages_current                             |
    | events_stages_history                             |
    | events_stages_history_long                        |
    | events_stages_summary_by_thread_by_event_name     |
    | events_stages_summary_global_by_event_name        |
    | events_statements_current                         |
    | events_statements_history                         |
    | events_statements_history_long                    |
    | events_statements_summary_by_thread_by_event_name |
    | events_statements_summary_global_by_event_name    |
    | events_waits_current                              |
    | events_waits_history                              |
    | events_waits_history_long                         |
    | events_waits_summary_by_instance                  |
    | events_waits_summary_by_thread_by_event_name      |
    | events_waits_summary_global_by_event_name         |
    | file_instances                                    |
    | file_summary_by_event_name                        |
    | file_summary_by_instance                          |
    | mutex_instances                                   |
    | objects_summary_global_by_type                    |
    | performance_timers                                |
    | rwlock_instances                                  |
    | setup_actors                                      |
    | setup_consumers                                   |
    | setup_instruments                                 |
    | setup_objects                                     |
    | setup_timers                                      |
    | table_io_waits_summary_by_index_usage             |
    | table_io_waits_summary_by_table                   |
    | table_lock_waits_summary_by_table                 |
    | threads                                           |
    +---------------------------------------------------+
    33 rows in set (0.00 sec)
    

    Each of the non-summary “events_stages” tables have this structure:

    mysql> desc events_stages_history_long;
    +--------------------+----------------------------------+------+-----+---------+-------+
    | Field              | Type                             | Null | Key | Default | Extra |
    +--------------------+----------------------------------+------+-----+---------+-------+
    | THREAD_ID          | int(11)                          | NO   |     | NULL    |       |
    | EVENT_ID           | bigint(20) unsigned              | NO   |     | NULL    |       |
    | EVENT_NAME         | varchar(128)                     | NO   |     | NULL    |       |
    | SOURCE             | varchar(64)                      | YES  |     | NULL    |       |
    | TIMER_START        | bigint(20) unsigned              | YES  |     | NULL    |       |
    | TIMER_END          | bigint(20) unsigned              | YES  |     | NULL    |       |
    | TIMER_WAIT         | bigint(20) unsigned              | YES  |     | NULL    |       |
    | NESTING_EVENT_ID   | bigint(20) unsigned              | YES  |     | NULL    |       |
    | NESTING_EVENT_TYPE | enum('STATEMENT','STAGE','WAIT') | YES  |     | NULL    |       |
    +--------------------+----------------------------------+------+-----+---------+-------+
    9 rows in set (0.01 sec)
    

    And each of the non-summary “events_statements” tables have this structure:

    mysql> desc events_statements_history_long;
    +-------------------------+----------------------------------+------+-----+---------+-------+
    | Field                   | Type                             | Null | Key | Default | Extra |
    +-------------------------+----------------------------------+------+-----+---------+-------+
    | THREAD_ID               | int(11)                          | NO   |     | NULL    |       |
    | EVENT_ID                | bigint(20) unsigned              | NO   |     | NULL    |       |
    | EVENT_NAME              | varchar(128)                     | NO   |     | NULL    |       |
    | SOURCE                  | varchar(64)                      | YES  |     | NULL    |       |
    | TIMER_START             | bigint(20) unsigned              | YES  |     | NULL    |       |
    | TIMER_END               | bigint(20) unsigned              | YES  |     | NULL    |       |
    | TIMER_WAIT              | bigint(20) unsigned              | YES  |     | NULL    |       |
    | LOCK_TIME               | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SQL_TEXT                | longtext                         | YES  |     | NULL    |       |
    | CURRENT_SCHEMA          | varchar(64)                      | YES  |     | NULL    |       |
    | OBJECT_TYPE             | varchar(64)                      | YES  |     | NULL    |       |
    | OBJECT_SCHEMA           | varchar(64)                      | YES  |     | NULL    |       |
    | OBJECT_NAME             | varchar(64)                      | YES  |     | NULL    |       |
    | OBJECT_INSTANCE_BEGIN   | bigint(20)                       | YES  |     | NULL    |       |
    | MYSQL_ERRNO             | int(11)                          | YES  |     | NULL    |       |
    | RETURNED_SQLSTATE       | varchar(5)                       | YES  |     | NULL    |       |
    | MESSAGE_TEXT            | varchar(128)                     | YES  |     | NULL    |       |
    | ERRORS                  | bigint(20) unsigned              | NO   |     | NULL    |       |
    | WARNINGS                | bigint(20) unsigned              | NO   |     | NULL    |       |
    | ROWS_AFFECTED           | bigint(20) unsigned              | NO   |     | NULL    |       |
    | ROWS_SENT               | bigint(20) unsigned              | NO   |     | NULL    |       |
    | ROWS_EXAMINED           | bigint(20) unsigned              | NO   |     | NULL    |       |
    | CREATED_TMP_DISK_TABLES | bigint(20) unsigned              | NO   |     | NULL    |       |
    | CREATED_TMP_TABLES      | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SELECT_FULL_JOIN        | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SELECT_FULL_RANGE_JOIN  | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SELECT_RANGE            | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SELECT_RANGE_CHECK      | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SELECT_SCAN             | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SORT_MERGE_PASSES       | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SORT_RANGE              | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SORT_ROWS               | bigint(20) unsigned              | NO   |     | NULL    |       |
    | SORT_SCAN               | bigint(20) unsigned              | NO   |     | NULL    |       |
    | NO_INDEX_USED           | bigint(20) unsigned              | NO   |     | NULL    |       |
    | NO_GOOD_INDEX_USED      | bigint(20) unsigned              | NO   |     | NULL    |       |
    | NESTING_EVENT_ID        | bigint(20) unsigned              | YES  |     | NULL    |       |
    | NESTING_EVENT_TYPE      | enum('STATEMENT','STAGE','WAIT') | YES  |     | NULL    |       |
    +-------------------------+----------------------------------+------+-----+---------+-------+
    37 rows in set (0.01 sec)

    So what are “stages”? As Marc says, they are “Major execution phases of a statement”, to classic MySQL users, they are what you know as the “State” column within SHOW PROCESSLIST for a running session, or the “Status” within a SHOW PROFILE snapshot when run within your own session.

    However, in performance_schema they not only named differently (yea, who knows why), they are tracked individually and over time for every instrumented thread (as Marc also noted, we can now filter instrumentation on threads as well, though I won’t talk about that in this post).

    So, for example if I wanted to see the stages that “thread 15” went through:

    mysql> select * from events_stages_history_long where thread_id = 15;
    +-----------+----------+--------------------------------+--------------------+-----------------+-----------------+------------+------------------+--------------------+
    | THREAD_ID | EVENT_ID | EVENT_NAME                     | SOURCE             | TIMER_START     | TIMER_END       | TIMER_WAIT | NESTING_EVENT_ID | NESTING_EVENT_TYPE |
    +-----------+----------+--------------------------------+--------------------+-----------------+-----------------+------------+------------------+--------------------+
    |        15 |        8 | stage/sql/init                 | sql_parse.cc:936   | 169201734212160 | 169202108901150 |  374688990 |                7 | STATEMENT          |
    |        15 |       16 | stage/sql/checking permissions | sql_parse.cc:4718  | 169202108901150 | 169202121817170 |   12916020 |                7 | STATEMENT          |
    |        15 |       17 | stage/sql/Opening tables       | sql_base.cc:4805   | 169202121817170 | 169202185009910 |   63192740 |                7 | STATEMENT          |
    |        15 |       18 | stage/sql/init                 | sql_select.cc:3584 | 169202185009910 | 169202259037630 |   74027720 |                7 | STATEMENT          |
    |        15 |       20 | stage/sql/optimizing           | sql_select.cc:1761 | 169202259037630 | 169202307953380 |   48915750 |                7 | STATEMENT          |
    |        15 |       21 | stage/sql/executing            | sql_select.cc:2778 | 169202307953380 | 169202352895550 |   44942170 |                7 | STATEMENT          |
    |        15 |       23 | stage/sql/end                  | sql_select.cc:3622 | 169202352895550 | 169202374889730 |   21994180 |                7 | STATEMENT          |
    |        15 |       24 | stage/sql/query end            | sql_parse.cc:4408  | 169202374889730 | 169202381928320 |    7038590 |                7 | STATEMENT          |
    |        15 |       25 | stage/sql/closing tables       | sql_parse.cc:4460  | 169202381928320 | 169202385916980 |    3988660 |                7 | STATEMENT          |
    |        15 |       26 | stage/sql/freeing items        | sql_parse.cc:5614  | 169202385916980 | 169202440796870 |   54879890 |                7 | STATEMENT          |
    |        15 |       27 | stage/sql/cleaning up          | sql_parse.cc:1482  | 169202440796870 | 169202453663880 |   12867010 |                7 | STATEMENT          |
    ...

    If you scroll to the right, you can see something that hasn’t been talked about a lot so far with performance_schema – it’s concept of “event nesting”. Each of the tables has a NESTING_EVENT_ID column, which links to it’s parent, and a NESTING_EVENT_TYPE column, which defines the parent event class – as either STATEMENT, STAGE or WAIT.

    5.5 has the NESTING_EVENT_ID column in the wait tables, but it is always NULL. Now however this is used throughout, and the stages themselves point to a parent event – 7 in the above case, of a STATEMENT class. So let’s take a look at what that statement was:

    mysql> select * from events_statements_history_long where event_id = 7\G
    *************************** 1. row ***************************
                  THREAD_ID: 15
                   EVENT_ID: 7
                 EVENT_NAME: statement/sql/select
                     SOURCE: sql_parse.cc:935
                TIMER_START: 410446612073000
                  TIMER_END: 410447340515000
                 TIMER_WAIT: 728442000
                  LOCK_TIME: 0
                   SQL_TEXT: select @@version_comment limit 1
             CURRENT_SCHEMA: NULL
                OBJECT_TYPE: NULL
              OBJECT_SCHEMA: NULL
                OBJECT_NAME: NULL
      OBJECT_INSTANCE_BEGIN: NULL
                MYSQL_ERRNO: 0
          RETURNED_SQLSTATE: NULL
               MESSAGE_TEXT: NULL
                     ERRORS: 0
                   WARNINGS: 0
              ROWS_AFFECTED: 0
                  ROWS_SENT: 1
              ROWS_EXAMINED: 0
    CREATED_TMP_DISK_TABLES: 0
         CREATED_TMP_TABLES: 0
           SELECT_FULL_JOIN: 0
     SELECT_FULL_RANGE_JOIN: 0
               SELECT_RANGE: 0
         SELECT_RANGE_CHECK: 0
                SELECT_SCAN: 0
          SORT_MERGE_PASSES: 0
                 SORT_RANGE: 0
                  SORT_ROWS: 0
                  SORT_SCAN: 0
              NO_INDEX_USED: 0
         NO_GOOD_INDEX_USED: 0
           NESTING_EVENT_ID: NULL
         NESTING_EVENT_TYPE: NULL
    1 row in set (0.00 sec)
    

    We use the adjacency list model to link the events together, so each event just points to it’s parent. From a pure SQL perspective this has some downsides, but given that we don’t update this data with SQL, the only real downside is not being able to represent the tree view nicely in a pure SQL output – it’s hard to find the depth of the tree, to be able to indent your output in the columns.

    So I had to find another way to represent this nicely, otherwise it’s just not that obvious how freaking awesome of an advancement in MySQL statement tracing this is.. Yet I love making SQL do things for me without having to wrap a program around it, so I thought I’d do another take on Shlomi Noach’s graphs with SQL.

    My approach is slightly different though, I wanted to be able to generate a true graph, like the picture above. The DOT language is perfect for representing that hierarchy, and is supported by a number of graphing libraries and tools – including Graphviz, which I used to generate my own images here.

    And so, without further ado, here’s how to generate an entire hierarchical stack trace of what a thread has done, with what is essentially a single SQL statement, in DOT format:

    CREATE DATABASE IF NOT EXISTS ps_helper;
    USE ps_helper;
    
    DROP PROCEDURE IF EXISTS dump_thread_stack;
    
    DELIMITER $$
    
    CREATE PROCEDURE dump_thread_stack(thd_id INT, debug INT)
    BEGIN
    
    	SELECT 'digraph events { rankdir=LR; nodesep=0.10;';
    	SELECT CONCAT('// Stack created: ', NOW());
    	SELECT CONCAT('// MySQL version: ', VERSION());
    	SELECT CONCAT('// MySQL user: ', CURRENT_USER());
    
    	SELECT CONCAT(IF(nesting_event_id IS NOT NULL, CONCAT(nesting_event_id, ' -> '), ''), 
    	              event_id, '; ', event_id, ' [label="',
    	              '(',ROUND(timer_wait/1000000, 2),'μ) ',
    	              IF (event_name NOT LIKE 'wait/io%', SUBSTRING_INDEX(event_name, '/', -2), event_name),
    	              IF (event_name LIKE 'statement/%', IFNULL(CONCAT('\n', wait_info), ''), ''),
    	              IF (debug AND event_name LIKE 'wait%', wait_info, ''),
    	              '", ', 
    	              CASE WHEN event_name LIKE 'wait/io/file%' THEN 
    	                     'shape=box, style=filled, color=red'
    	                   WHEN event_name LIKE 'wait/io/table%' THEN 
    	                     'shape=box, style=filled, color=lightcoral'
    	                   WHEN event_name LIKE 'wait/synch/mutex%' THEN
    	                     'style=filled, color=lightskyblue'
    	                   WHEN event_name LIKE 'wait/synch/cond%' THEN
    	                     'style=filled, color=darkseagreen3'
    	                   WHEN event_name LIKE 'wait/synch/rwlock%' THEN
    	                     'style=filled, color=orchid'
    	                   WHEN event_name LIKE 'wait/lock%' THEN
    	                     'shape=box, style=filled, color=tan'
    	                   WHEN event_name LIKE 'statement/%' THEN
    	                     CONCAT('shape=box, style=bold', 
    	                            IF((timer_wait/1000000000000) > @@long_query_time, ' style=filled, color=red', ''))
    	                   WHEN event_name LIKE 'stage/%' THEN
    	                     'style=filled, color=slategray3'
    	                   ELSE '' END,
    	               '];') event	
    	  FROM (
    	       (SELECT thread_id, event_id, event_name, timer_wait, timer_start, nesting_event_id, 
    	               CONCAT(sql_text, '\n',
    	                      'errors: ', errors, '\n',
    	                      'warnings: ', warnings, '\n',
    	                      'lock time: ', ROUND(lock_time/1000000, 2),'μ\n',
    	                      'rows affected: ', rows_affected, '\n',
    	                      'rows sent: ', rows_sent, '\n',
    	                      'rows examined: ', rows_examined, '\n',
    	                      'tmp tables: ', created_tmp_tables, '\n',
    	                      'tmp disk tables: ', created_tmp_disk_tables, '\n'
    	                      'select scan: ', select_scan, '\n',
    	                      'select full join: ', select_full_join, '\n',
    	                      'select full range join: ', select_full_range_join, '\n',
    	                      'select range: ', select_range, '\n',
    	                      'select range check: ', select_range_check, '\n', 
    	                      'sort merge passes: ', sort_merge_passes, '\n',
    	                      'sort rows: ', sort_rows, '\n',
    	                      'sort range: ', sort_range, '\n',
    	                      'sort scan: ', sort_scan, '\n',
    	                      'no index used: ', IF(no_index_used, 'TRUE', 'FALSE'), '\n',
    	                      'no good index used: ', IF(no_good_index_used, 'TRUE', 'FALSE'), '\n'
    	                      ) AS wait_info
    	          FROM performance_schema.events_statements_history_long)
    	       UNION 
    	       (SELECT thread_id, event_id, event_name, timer_wait, timer_start, nesting_event_id, null AS wait_info
    	          FROM performance_schema.events_stages_history_long) 
    	       UNION 
    	       (SELECT thread_id, event_id, 
    	               CONCAT(event_name, 
    	                      IF(event_name NOT LIKE 'wait/synch/mutex%', IFNULL(CONCAT(' - ', operation), ''), ''), 
    	                      IF(number_of_bytes IS NOT NULL, CONCAT(' ', number_of_bytes, ' bytes'), ''),
    	                      IF(event_name LIKE 'wait/io/file%', '\n', ''),
    	                      IF(object_schema IS NOT NULL, CONCAT('\nObject: ', object_schema, '.'), ''), IFNULL(object_name, ''),
    	                      IF(index_name IS NOT NULL, CONCAT(' Index: ', index_name), ''), '\n'
    	                      ) AS event_name,
    	               timer_wait, timer_start, nesting_event_id, source AS wait_info
    	          FROM performance_schema.events_waits_history_long)) events 
    	 WHERE thread_id = thd_id
    	 ORDER BY event_id;
    	 
    	 SELECT '}';
    	
    END$$
    
    DELIMITER ;
    

    The first parameter to the procedure is the thread id that you want to dump the stack for. The second is a debug flag, which includes the source file and line number that wait events came from for each node.

    To find the thread that you are interested in, you should check the performance_schema.threads table, which includes everything you should need to link this back to a running connection:

    mysql> select thread_id, processlist_id, processlist_user, processlist_host
        ->   from threads
        ->  where type = 'foreground';
    +-----------+----------------+------------------+------------------+
    | thread_id | processlist_id | processlist_user | processlist_host |
    +-----------+----------------+------------------+------------------+
    |        18 |              4 | root             | localhost        |
    +-----------+----------------+------------------+------------------+
    1 row in set (0.01 sec)
    

    You can either call this stored procedure directly and output all of it’s result sets directly to a file in your own language, or be like me, and be lazy, and call it like this:

    $ mysql -BN -u root -e "call ps_helper.dump_thread_stack(17, 0)" > ~/Desktop/stack.dot

    Load the dot file in to your favorite graphing solution, and you will get the full stack of event information that is available within performance_schema for that thread. If you’re lucky enough to capture a full trace of what the session did since it started, you might get a thread stack looking something like this:

    Full MyISAM Thread Stack
    Full Stack

    Statements at the top of the graph are the most recent, with the earliest statement executed at the bottom of the graph. The legend for the graph would be:

    • Grey – Stages
    • Blue – Mutexes
    • Orchid – RW Locks
    • Tan – Table Locks
    • Light Coral – Table IO
    • Red – File IO

    If the statement box is a solid red colour, then the statement ran for longer than the long_query_time setting for the instance the trace was generated on.

    You will not always be this lucky however, wait information within performance_schema degrades very quickly depending how you have it configured, especially on busy systems. Each table holds this many events of each class by default:

    
    mysql> show variables like 'perf%events%size';
    +--------------------------------------------------------+-------+
    | Variable_name                                          | Value |
    +--------------------------------------------------------+-------+
    | performance_schema_events_stages_history_long_size     | 1000  |
    | performance_schema_events_stages_history_size          | 10    |
    | performance_schema_events_statements_history_long_size | 1000  |
    | performance_schema_events_statements_history_size      | 10    |
    | performance_schema_events_waits_history_long_size      | 10000 |
    | performance_schema_events_waits_history_size           | 10    |
    +--------------------------------------------------------+-------+
    6 rows in set (0.00 sec)
    

    The each statement can have 10-15 or more stages per statement, and 10’s to 1000’s of wait events individually, so you will need to use some good judgement on how to size this, depending on your use cases.

    However, when you do have a degraded output, this still gives you quite a bit of information for the statements anyway:

    Degraded Stack
    Degraded Stack

    I’ve found it to be quite an eye opener in some respects, to see how many times certain things are locked in the background.

    Here’s a few examples:

    For the eagle eyed and in the know – these graphs actually show a bug in the underlying performance_schema instrumentation in the release I am testing against, in two areas. First, some wait events are getting linked up to the wrong stages. We also seem to double account for IO in InnoDB in some cases. We’re tracking those things down at the moment..

    But even with those wrinkles, I think it’s obvious what kind of value this is going to provide in 5.6. And there’s more – I’ll have to talk about filtering, and the other summary views, at some point as well!

    I’m pretty sure my topic was the right choice now.. πŸ™‚

    Quickly testing MySQL builds without “make install”

    There is a lot happening in mysql-trunk nowadays, and it’s a pain to have to keep pulling from the tree, building, installing and configuring to some test location etc. (unless you have that scripted, which I used to in the past).

    Marc Alff showed me a tip a while ago to not have to do the install/config stage, and I thought it was worth sharing – it’s how I keep up to date with trying out the new things that I see fly by in the commits list.

    First, you of course need to be able to build the MySQL Server from source. This is not as hard as it seems on most platforms, don’t be daunted.

    Once you have the dependencies in place for this, and the mysql-trunk tree branched:

    $ cd ~/
    $ mkdir mysql
    $ cd mysql
    $ bzr init-repo .
    $ bzr branch lp:~mysql/mysql-server/mysql-trunk mysql-trunk
    

    And a build done (using your favorite build script, or following some of the above linked instructions):

    $ cd ~/mysql/mysql-trunk
    $ BUILD/compile-pentinum-debug-max-no-ndb
    testing nocona ... ok
    +++ /bin/rm -rf configure
    +++ /bin/rm -rf CMakeCache.txt CMakeFiles/
    +++ path=BUILD
    ...
    Linking CXX shared module ha_example.so
    [100%] Built target example
    [100%] Building CXX object plugin/semisync/CMakeFiles/semisync_master.dir/semisync_master_plugin.cc.o
    Linking CXX shared module semisync_slave.so
    [100%] Built target semisync_slave
    Linking CXX shared module semisync_master.so
    [100%] Built target semisync_master
    

    All you have to do to try out that build now is start the test infrastructure, with the following:

    $ cd mysql-test
    $ ./mtr --start parser &
    [1] 15906
    $ Logging: ./mtr  --start parser
    110329 13:26:50 [Warning] Setting lower_case_table_names=2 because file system for /var/folders/xl/xl3jnqYbFHOYXXGB0cZuB++++TI/-Tmp-/WA3QLELTBm/ is case insensitive
    110329 13:26:50 [Note] Plugin 'FEDERATED' is disabled.
    110329 13:26:50 [Note] Binlog end
    110329 13:26:50 [Note] Shutting down plugin 'CSV'
    110329 13:26:50 [Note] Shutting down plugin 'MyISAM'
    MySQL Version 5.6.3
    Checking supported features...
     - skipping ndbcluster
     - SSL connections supported
     - binaries are debug compiled
    Collecting tests...
    vardir: /Users/mark/mysql/lp-mysql-trunk/mysql-test/var
    Checking leftover processes...
     - found old pid 50573 in 'mysqld.1.pid', killing it...
       process did not exist!
    Removing old var directory...
    Creating var directory '/Users/mark/mysql/lp-mysql-trunk/mysql-test/var'...
    Installing system database...
    Using server port 49463
    
    ==============================================================================
    
    TEST                                      RESULT   TIME (ms) or COMMENT
    --------------------------------------------------------------------------
    
    worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
    worker[1] 
    Started [mysqld.1 - pid: 15918, winpid: 15918]
    worker[1] Using config for test main.parser
    worker[1] Port and socket path for server(s):
    worker[1] mysqld.1  13000  /Users/mark/mysql/lp-mysql-trunk/mysql-test/var/tmp/mysqld.1.sock
    worker[1] Waiting for server(s) to exit...
    $
    

    This starts up an instance, based on your build tree, that will wait around for new interactive connections, rather than running a specific set of tests and exiting. To connect to the instance:

    $ ../client/mysql -u root -S ./var/tmp/mysqld.1.sock 
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 1
    Server version: 5.6.3-m5-debug-log Source distribution
    
    Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
    
    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.
    
    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    
    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | mtr                |
    | mysql              |
    | performance_schema |
    | test               |
    +--------------------+
    5 rows in set (0.00 sec)
    

    For most testing that doesn’t involve replication I’ve found this the best way to try and keep up to date with the main line. It means you can maintain a “default install” location for more serious testing (the one in which you might “make install” to), whilst continually building and testing with the current tree.

    SHOW ENGINE INNODB STATUS truncation, innodb_truncated_status_writes

    Another piece of good news for MySQL 5.5 – the output of SHOW ENGINE INNODB STATUS has now been increased from 64kB, to 1MB. For those running with systems that have thousands of running transactions, or large lock outputs, it should take quite a bit more to force truncation now. πŸ™‚

    We also added a new status variable to track when truncation happens as well – innodb_truncated_status_writes, so you can detect this should you have automated monitoring depending on this output.

    Bug#56922 for details.

    MySQL Enterprise Monitor Learns PERFORMANCE_SCHEMA

    Leading up to my previous post, I had been doing some work to start the integration of PERFORMANCE_SCHEMA data with MySQL Enterprise Monitor, including some new graphs based on some of the data that I talked about in the above post..

    A picture tells a thousand words:

    io_graphs

    This is only scratching the surface – more to come, watch this space! πŸ™‚