Finding and killing long running InnoDB transactions with Events

I’ve seen a number of solutions for finding long running transactions or sessions within InnoDB / MySQL now. Every single one of them has (in the past by necessity) been implemented as a script (other than one, more on that one later) that is either invoked manually, or via some cron job, that then connects and tries to find, and possibly diagnose or kill, transactions that break some “long running” threshold. We also have a rule in MySQL Enterprise Monitor that flags this for further inspection.

Solutions like this have one major downfall however – they often require a new connection to the database every time that they want to perform the check (though MEM holds a connection open for monitoring).

Often what you see with many high throughput instances is that once you get a long running transaction that is holding some locks, many sessions soon start to stack up behind this, and in many cases fill up the instance to max_connections. You then play roulette on whether you can get a new connection to start finding and killing the long running transaction(s) or not.

In the past this was a necessity because it was impossible to parse SHOW ENGINE INNODB STATUS from within the database (such as within a procedure), to then build a set of KILL statements programatically (because you couldn’t consume the SHOW output in a meaningful way).

However, with 5.1 and the InnoDB plugin (and hence, 5.5+ by default too), there is the new INFORMATION_SCHEMA.INNODB_TRX table, which lists all currently active transactions within InnoDB, including the same output as SHOW ENGINE INNODB STATUS per transaction, plus a few extras:

mysql> select * from INFORMATION_SCHEMA.INNODB_TRX\G
*************************** 1. row ***************************
                    trx_id: 7C6
                 trx_state: RUNNING
               trx_started: 2011-05-31 11:17:10
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 2
       trx_mysql_thread_id: 11706
                 trx_query: insert into t1 values (sleep(40))
       trx_operation_state: NULL
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 1
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 1
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000

Because we have this nice shiny new table, personally I think it’s time to start throwing away most of those scripts, and using another of the new features in 5.1 – events. The major advantage to this is that it is always running within the instance, it can’t be blocked by any max connection limit, and it doesn’t require any DBA / admin intervention, it’s the perfect BOFH tool. Developers Beware.

So here’s the basis for how I would do it:

CREATE DATABASE IF NOT EXISTS myadmin;
USE myadmin;

DROP TABLE IF EXISTS kill_long_transactions_log;

CREATE TABLE kill_long_transactions_log (
  id INT AUTO_INCREMENT PRIMARY KEY,
  ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  thd BIGINT,
  user VARCHAR(16),
  host VARCHAR(64),
  statement TEXT,
  KEY (ts, user),
  KEY (user),
  KEY (host)
) ENGINE = InnoDB;

DELIMITER $$

DROP EVENT IF EXISTS kill_long_transactions$$

CREATE EVENT kill_long_transactions
ON SCHEDULE EVERY 10 SECOND
DO
  BEGIN
	  DECLARE max_transaction_time INT DEFAULT 10; 
	  DECLARE done INT DEFAULT 0;
	  DECLARE killed_id BIGINT;
	  DECLARE killed_user VARCHAR(16);
	  DECLARE killed_host VARCHAR(64);
	  DECLARE kill_stmt VARCHAR(20);
	  DECLARE running_stmt TEXT;

	  DECLARE long_transactions CURSOR FOR
	   SELECT CONCAT('KILL ', trx.trx_mysql_thread_id) kill_statement,
	          trx.trx_mysql_thread_id thd_id,
	          ps.user,
	          ps.host,
	          trx.trx_query
	     FROM INFORMATION_SCHEMA.INNODB_TRX trx
	     JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id
	    WHERE (UNIX_TIMESTAMP() - UNIX_TIMESTAMP(trx.trx_started)) > max_transaction_time
	      AND user != 'system_user';
	  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
	
	  OPEN long_transactions;
	  
	  kill_loop: LOOP
	    FETCH long_transactions INTO 
	      kill_stmt, killed_id, killed_user, killed_host, running_stmt;
	
	    IF done THEN
	      LEAVE kill_loop;
	    END IF;
	    
	    SET @kill := kill_stmt;
	    
	    PREPARE stmt FROM @kill;
	    EXECUTE stmt;
	    DEALLOCATE PREPARE stmt;
	    
	    INSERT INTO kill_long_transactions_log (thd, user, host, statement)
	    VALUES (killed_id, killed_user, killed_host, running_stmt);
	    
	  END LOOP;
	  CLOSE long_transactions;
	  
	  DELETE FROM kill_long_transactions_log
	   WHERE ts < NOW() - INTERVAL 7 DAY;
  END$$

DELIMITER ;

You can tweak the following from the above to fit your needs:

How often to check:

ON SCHEDULE EVERY 10 SECOND

How long should a transaction be running to get killed:

DECLARE max_transaction_time INT DEFAULT 10;

How long to keep data for before purging:

WHERE ts < NOW() - INTERVAL 7 DAY;

Once you have that created, make sure that the event scheduler is enabled:

mysql> SHOW GLOBAL VARIABLES LIKE 'event_scheduler';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+

And sit back and wait for the magic to happen.

Now if you want to poll for long running transactions, you can check the kill_long_transactions_log table after the fact for anything that has been killed (not should be killed):

mysql> select * from kill_long_transactions_log;
+----+---------------------+-------+----------+-----------+-----------------------------------+
| id | ts                  | thd   | user     | host      | statement                         |
+----+---------------------+-------+----------+-----------+-----------------------------------+
|  1 | 2011-05-31 11:13:15 | 11672 | msandbox | localhost | NULL                              |
|  2 | 2011-05-31 11:14:25 | 11681 | msandbox | localhost | insert into t1 values (sleep(40)) |
|  3 | 2011-05-31 11:16:15 | 11695 | msandbox | localhost | NULL                              |
|  4 | 2011-05-31 11:17:25 | 11706 | msandbox | localhost | insert into t1 values (sleep(40)) |
+----+---------------------+-------+----------+-----------+-----------------------------------+

Sometimes there may be a statement running when the session is killed, other times it could just be idle (and hence NULL), but still have a transaction open.

As a DBA, you might then want to try and find out what those sessions are doing, and in that case, you can disable/enable this event pretty easily at run time:

mysql> ALTER EVENT kill_long_transactions DISABLE;
Query OK, 0 rows affected (0.03 sec)

mysql> ALTER EVENT kill_long_transactions ENABLE;
Query OK, 0 rows affected (0.00 sec)

In < = 5.5, if you want to try and find out what these long running transactions are doing whilst this is disabled, I have a couple of recommendations:

Yoshinori’s solution is one of the only ones that doesn’t require a connection to the database, doing all of it’s sniffing external to the instance, somewhat like MEM’s Query Analyzer (though sniffing network traffic with libpcap instead), however unlike Query Analyzer, it tries to record the entire transaction verbatim, instead of aggregating likewise statements (which can happen with long running transactions that loop to insert data etc.).

I think it would be possible to do the same entirely from within performance_schema as of 5.6, using the events_statements_history table if enabled, from an event as well. This may serve both needs quite nicely, with no external solution required.

Maybe I’ll follow up on that one day.. :)

6 comments

  1. @Giuseppe Maxia: Hey, thanks for the references!

    The nice thing with mine is that it can look at the entire transaction length now rather than just a current query (and in effect acts for both of them).

    Another thing of note, both of those write these in the non-lazy-way that I should have done (and yet another colleague pointed out to me a short while ago) – putting the logic in to a stored procedure, and just calling the procedure from the event. Shame on me. :)

Got something to say?