Tracking IO with PERFORMANCE_SCHEMA

Mark Callaghan over at Facebook wrote a note recently about InnoDB disk IO counters in SHOW STATUS, with some extra things that he wanted to track. I posted a quick comment over there, but I thought this deserved it’s own write up.

MySQL 5.5’s PERFORMANCE_SCHEMA has had a fair bit written about it in terms of tracking synchronization point contention (mutexes etc.), but it currently tracks two orders within the wait class – these are /wait/synch and /wait/io.

Actually, allow me to detour first, it’s not clear from the documentation, though it is clear in the worklog. Each event within PERFORMANCE_SCHEMA is a specific instrumentation point, tracked in some way, such as recording the time taken to complete the event (a specific code section). The naming convention for each event follows Linnaean taxonomy/Class/Order/Family/Genus/Species. We currently have one major class – /wait.

The /wait/synch order is already fleshed out fully, including /wait/synch/cond (condition signalling), /wait/synch/mutex (mutex locks) and /wait/synch/rwlock (read/write locks) families. Detour done..

The /wait/io order has one family within it currently – /wait/io/file, which instruments all file IO for the SQL layer (the /wait/io/file/sql genus), tracking binary logs, frm file access, info files, etc., as well as for all table/index file IO done by instrumented storage engines (such as the /wait/io/file/innodb / /wait/io/file/myisam genera).

Mark shows in his blog how you can already get some pretty good statistics on what is taking up IO on the server, so let’s look at how we can fix some of the deficiencies mentioned, such as:

“Innodb_data_writes describes the number of writes done but all writes are not created equal. Writes to the log file are small and use buffered IO. These are much less expensive that writes done to update InnoDB pages in place. But Innodb_data_writes does not distinguish between them.”

“Innodb_pages_written describes the number of pages written but not the number of IO requests used to write them as write requests for adjacent pages can be merged and some of the code that schedules pages to be written attempts to find adjacent dirty pages to be written at the same time.”

This is actually really easy to track with PERFORMANCE_SCHEMA now that InnoDB has been fully instrumented. InnoDB has created 3 individual species within it’s own genus:

/wait/io/file/innodb/innodb_data_file – file IO for central or single tablespace data files
/wait/io/file/innodb/innodb_log_fileredo log file IO
/wait/io/file/innodb/innodb_temp_file – IO against a sort merge file that is used when adding indexes

You can track how much time has been spent on these events with the EVENTS_WAITS_SUMMARY_* tables, which I won’t talk about here, see my earlier post for examples.

You can also track IO use in terms of bytes however, using the FILE_SUMMARY_* tables. There are currently two of these:

FILE_SUMMARY_BY_EVENT_NAME – summarizes file IO events per instrumentation point
FILE_SUMMARY_BY_INSTANCE – summarizes file IO events per file instance

Mark’s problem focuses on the high level, looking at IO as single counters for what is almost specific species of IO, which relates to FILE_SUMMARY_BY_EVENT_NAME. An example of the data that you can get (scroll to the right):

SELECT EVENT_NAME, 
       COUNT_READ, 
       IFNULL(ROUND(SUM_NUMBER_OF_BYTES_READ/1024/1024/1024, 2), 0.00) AS READ_GB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_READ/COUNT_READ)/1024, 2), 0.00) AS AVG_READ_KB,
       COUNT_WRITE, 
       IFNULL(ROUND(SUM_NUMBER_OF_BYTES_WRITE/1024/1024/1024, 2), 0.00) AS WRITE_GB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_WRITE/COUNT_WRITE)/1024, 2), 0.00) AS AVG_WRITE_KB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_READ + SUM_NUMBER_OF_BYTES_WRITE)/1024/1024/1024, 2), 0.00) AS TOTAL_GB, 
       IFNULL(100-((SUM_NUMBER_OF_BYTES_READ/(SUM_NUMBER_OF_BYTES_READ+SUM_NUMBER_OF_BYTES_WRITE))*100), 0.00) AS WRITE_PCT 
  FROM FILE_SUMMARY_BY_EVENT_NAME
 WHERE EVENT_NAME IN ('wait/io/file/innodb/innodb_log_file', 'wait/io/file/innodb/innodb_data_file');

+--------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
| EVENT_NAME                           | COUNT_READ | READ_GB | AVG_READ_KB | COUNT_WRITE | WRITE_GB | AVG_WRITE_KB | TOTAL_GB | WRITE_PCT |
+--------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
| wait/io/file/innodb/innodb_data_file |     359183 |    3.82 |       11.15 |    39642666 |  1438.62 |        38.05 |  1442.44 |   99.7353 |
| wait/io/file/innodb/innodb_log_file  |         12 |    0.00 |       11.33 |    23337913 |    21.43 |         0.96 |    21.43 |   99.9994 |
+--------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
2 rows in set (0.00 sec)

But wait, it doesn’t end there. Mark also wants to be able to break out writes to things like the doublewrite buffer. This is hard to do when tracking data file IO alone as single counters as above, because all writes to the doublewrite buffer also go to the central tablespace’s data file.

Of course, you can SELECT the value for Innodb_dblwr_writes as Mark does and subtract that from the above counters if you want to, but I have a slightly different example using the FILE_SUMMARY_BY_INSTANCE table.

When using innodb_file_per_table (and if you are not, why not? ;)), you also get the added benefit that any data file IO for tables is tracked individually, here’s an example of finding the top 10 consumers:

SELECT SUBSTRING_INDEX(FILE_NAME, '/', -2) AS DATA_FILE, 
       COUNT_READ, 
       IFNULL(ROUND(SUM_NUMBER_OF_BYTES_READ/1024/1024/1024, 2), 0.00) AS READ_GB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_READ/COUNT_READ)/1024, 2), 0.00) AS AVG_READ_KB,
       COUNT_WRITE, 
       IFNULL(ROUND(SUM_NUMBER_OF_BYTES_WRITE/1024/1024/1024, 2), 0.00) AS WRITE_GB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_WRITE/COUNT_WRITE)/1024, 2), 0.00) AS AVG_WRITE_KB,
       IFNULL(ROUND((SUM_NUMBER_OF_BYTES_READ + SUM_NUMBER_OF_BYTES_WRITE)/1024/1024/1024, 2), 0.00) AS TOTAL_GB, 
       IFNULL(100-((SUM_NUMBER_OF_BYTES_READ/(SUM_NUMBER_OF_BYTES_READ+SUM_NUMBER_OF_BYTES_WRITE))*100), 0.00) AS WRITE_PCT 
  FROM FILE_SUMMARY_BY_INSTANCE 
 ORDER BY TOTAL_GB DESC LIMIT 10;

+---------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
| DATA_FILE                             | COUNT_READ | READ_GB | AVG_READ_KB | COUNT_WRITE | WRITE_GB | AVG_WRITE_KB | TOTAL_GB | WRITE_PCT |
+---------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
| data/ibdata1                          |       3201 |    0.05 |       17.26 |    22978247 |  1265.68 |        57.76 |  1265.74 |   99.9958 |
| mem/inventory_instance_attributes.ibd |       6015 |    0.09 |       16.00 |     2104313 |    32.11 |        16.00 |    32.20 |   99.7150 |
| mem/statement_examples.ibd            |      53171 |    0.81 |       16.00 |     1831051 |    29.12 |        16.68 |    29.94 |   97.2897 |
| mem/loghistogram_data.ibd             |      33687 |    0.51 |       16.00 |     1333759 |    20.51 |        16.13 |    21.03 |   97.5556 |
| data/ib_logfile0                      |         12 |    0.00 |       11.33 |    11735008 |    10.75 |         0.96 |    10.75 |   99.9988 |
| data/ib_logfile1                      |          0 |    0.00 |        0.00 |    11645624 |    10.71 |         0.96 |    10.71 |  100.0000 |
| mem/statement_summary_data.ibd        |      26205 |    0.40 |       16.00 |      547254 |     8.47 |        16.22 |     8.87 |   95.4898 |
| mem/dc_p_string#P#p4.ibd              |         61 |    0.00 |        8.13 |      462098 |     3.53 |         8.00 |     3.53 |   99.9866 |
| mem/dc_p_string#P#p5.ibd              |         59 |    0.00 |        8.14 |      457733 |     3.49 |         8.00 |     3.49 |   99.9869 |
| mem/dc_p_string#P#p6.ibd              |         46 |    0.00 |        8.00 |      452797 |     3.45 |         8.00 |     3.45 |   99.9898 |
+---------------------------------------+------------+---------+-------------+-------------+----------+--------------+----------+-----------+
10 rows in set (0.00 sec)

The database instance the above comes from is only using innodb_file_per_table tables, so why all the writes to the central tablespace? Two major reasons.. The doublewrite buffer, and the undo log..

Also note that we can get down to the partition level here as well, so we can find hot partitions too.

Subtracting the doublewrite overhead from the central tablespace datafile IO should give you the added benefit of also drilling in to undo log IO writes as well (perhaps with some extra fuzz factor for other writes to the central tablespace that happen).

11 thoughts on “Tracking IO with PERFORMANCE_SCHEMA”

  1. Indeed Paul, though this does not explain the structure of the naming, only that some structure is defined:

    “Instrument names consist of a prefix with a structure defined by the Performance Schema implementation and a suffix defined by the developer implementing the instrument code.”

    Feel free to add the above to the docs if you think it’s appropriate. 🙂

  2. just installed 5.5 in production but before we got to really try it out, it crashes every few minutes.

    101120 8:29:27 InnoDB: Operating system error number 22 in a file operation.
    InnoDB: Error number 22 means ‘Invalid argument’.
    InnoDB: Some operating system error numbers are described at
    InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html
    InnoDB: File name /tmpfs/#sql6cf3_5c_0.ibd
    InnoDB: File operation call: ‘aio write’.
    InnoDB: Cannot continue operation.

    I’m still looking but I wonder if there is any way to track mysql internal tmp tables that are being created (/tmpfs is a 16G assigned filesystem specifically configured).
    Curious what tracking is available for these

  3. Do you have a test case for the above Ronald? I’ve looked around the bugs system and the only one close that I can find is Bug#54044 – it has a similar report, but that was never verified..

    As for tracking temporary tables – you can get a pretty good idea with something like:

    SELECT * FROM FILE_SUMMARY_BY_INSTANCE WHERE FILE_NAME LIKE ‘%#sql%’;

    That gives output like:


    *************************** 9. row ***************************
    FILE_NAME: /home/mark/sandboxes/Complex/mysqld1/data/mem/#sql-9bc_1.MYI
    EVENT_NAME: wait/io/file/myisam/kfile
    COUNT_READ: 2
    COUNT_WRITE: 27
    SUM_NUMBER_OF_BYTES_READ: 554
    SUM_NUMBER_OF_BYTES_WRITE: 530
    *************************** 10. row ***************************
    FILE_NAME: /home/mark/sandboxes/Complex/mysqld1/data/mem/#sql-dc_p_string.par
    EVENT_NAME: wait/io/file/sql/partition
    COUNT_READ: 0
    COUNT_WRITE: 30
    SUM_NUMBER_OF_BYTES_READ: 0
    SUM_NUMBER_OF_BYTES_WRITE: 5896
    *************************** 11. row ***************************
    FILE_NAME: /home/mark/sandboxes/Complex/mysqld1/data/mem/#sql2-9bc-66c9.ibd
    EVENT_NAME: wait/io/file/innodb/innodb_data_file
    COUNT_READ: 0
    COUNT_WRITE: 3
    SUM_NUMBER_OF_BYTES_READ: 0
    SUM_NUMBER_OF_BYTES_WRITE: 49152

    You might want to filter out frm and par files etc. too..

Got something to say?