MySQL下xtrabackup与MTS造成的死锁

测试背景 MySQL的物理备份在企业版中是有的,但是community版本的话是不提供这个版本的,但是percona作为MySQL的技术服务公司,开源了自己的物理备份工具--xtrabackup。

测试背景

MySQL的物理备份在企业版中是有的,但是community版本的话是不提供这个版本的,但是percona作为MySQL的技术服务公司,开源了自己的物理备份工具--xtrabackup。本文主要是针对生产环境使用xtrabackup遇到的坑。

测试环境

  • Centos 7.4
  • MySQL 5.7.25
  • Xtrabackup 2.4.12

准备工作

  • 在slave上使用物理备份
  • MySQL开启MTS

生产环境

现象

image

分析

  • 会话4为Coordinator线程分发的事务还不能执行,进入waiting for dependent transaction to commit的状态,会话3290897、4、3线程之间形成死锁,其中会话3290897占有全局读锁,获取全局commit锁的时候进入阻塞,等待会话4释放事务涉及到表的commit锁;会话3、4的事务可以并行复制,会话4占有表级commit锁,但是事务对应的binlog在后面(在MTS下,slave_preserve_commit_order=1时,保证事务是按照relay log中记录的顺序来回放 ),阻塞等待会话3先提交进入waiting for preceding transaction to commit的状态;会话3事务执行时提交要获得表级commit锁,但已经被会话3290897占有,所以阻塞等待。这样形成了3290897->4->3->3290897的死锁。

解决方案1:设置ftwrl-wait-timeout

参数解析

* --ftwrl-wait-query-type=all|update
 * 该选项指定flush table with read lock;语句执行之前必须等待什么类型的查询执行完成,默认值是all,有效值为:all和update,其中update包含UPDATE、ALTER、REPLACE、INSERT等类型语句。
 * 该选项执行等待的时间由选项--ftwrl-wait-threshold设置,默认值为60秒,要注意:如果--ftwrl-wait-timeout选项设置为非零值,则--ftwrl-wait-threshold选项不起作用,以--ftwrl-wait-timeout选项设置的值为准,如果--ftwrl-wait-timeout选项为0秒(默认值),则以--ftwrl-wait-threshold选项指定的值为准
 * 执行该选项需要有process和super权限

* --ftwrl-wait-timeout=SECONDS
 * 该选项指定innobackupex阻塞执行flush table with read lock;语句的时长以等待查询执行完成,如果该选项指定一个非零值,则如果超过指定时间之后仍然有查询在执行,则报错终止备份过程,如果使用零值,则不等待查询执行完成,立即执行flush table with read lock;语句
 * 该选项默认值为0

* --ftwrl-wait-threshold=SECONDS
 * 该选项指定innobackupex检测超过该选项指定的阀值的时间运行的查询,如果选项--ftwrl-wait-timeout指定为一个非零值,则--ftwrl-wait-threshold选项不会执行FTWRL,直到长时间执行的查询执行完成并退出之后才会执行FTWRL,如果--ftwrl-wait-timeout指定为一个零值,则--ftwrl-wait-threshold选项不起作用。个人觉得从官方文档上的解释来看, 以--ftwrl-wait-timeout选项指定的值为准即可,--ftwrl-wait-threshold选项在--ftwrl-wait-timeout选项为零值与非零值时都不起作用。
 * --ftwrl-wait-threshold选项默认值为60秒

ftwrl-wait-timeout=20

root@localhost : (none) 07:35:44> show processlist;
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| 4 | system user | | NULL | Connect | 34367 | executing | NULL |
| 7 | system user | | NULL | Connect | 34257 | Slave has read all relay log; waiting for more updates | NULL |
| 8 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 21 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 22 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 23 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 45 | root | localhost | test | Query | 23 | User sleep | select sleep(200),id from sbtest1 |
| 55 | root | localhost | NULL | Query | 0 | starting | show processlist |
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
20 rows in set (0.00 sec)

# /usr/local/xtrabackup-2.4/bin/innobackupex --defaults-file=/etc/mysql/my.cnf --user=powdba --password=abc123 --host=127.0.0.1 --ftwrl-wait-timeout=20 /data/backup/
...
190319 19:57:40 >> log scanned up to (5875859480)
190319 19:57:40 Waiting 20 seconds for queries running longer than 60 seconds to finish
190319 19:57:40 Waiting for query 45 (duration 138 sec): select sleep(200),id from sbtest1190319 19:57:41 >> log scanned up to (5875859480)
190319 19:57:41 Waiting for query 45 (duration 139 sec): select sleep(200),id from sbtest1190319 19:57:42 >> log scanned up to (5875859480)
190319 19:57:42 Waiting for query 45 (duration 140 sec): select sleep(200),id from sbtest1190319 19:57:43 >> log scanned up to (5875859480)
190319 19:57:43 Waiting for query 45 (duration 141 sec): select sleep(200),id from sbtest1190319 19:57:44 >> log scanned up to (5875859480)
190319 19:57:44 Waiting for query 45 (duration 142 sec): select sleep(200),id from sbtest1190319 19:57:45 >> log scanned up to (5875859480)
190319 19:57:45 Waiting for query 45 (duration 143 sec): select sleep(200),id from sbtest1190319 19:57:46 >> log scanned up to (5875859480)
190319 19:57:46 Waiting for query 45 (duration 144 sec): select sleep(200),id from sbtest1190319 19:57:47 >> log scanned up to (5875859480)
190319 19:57:47 Waiting for query 45 (duration 145 sec): select sleep(200),id from sbtest1190319 19:57:48 >> log scanned up to (5875859480)
190319 19:57:48 Waiting for query 45 (duration 146 sec): select sleep(200),id from sbtest1190319 19:57:49 >> log scanned up to (5875859480)
190319 19:57:49 Waiting for query 45 (duration 147 sec): select sleep(200),id from sbtest1190319 19:57:50 >> log scanned up to (5875859480)
190319 19:57:50 Waiting for query 45 (duration 148 sec): select sleep(200),id from sbtest1190319 19:57:51 >> log scanned up to (5875859480)
190319 19:57:51 Waiting for query 45 (duration 149 sec): select sleep(200),id from sbtest1190319 19:57:52 >> log scanned up to (5875859480)
190319 19:57:52 Waiting for query 45 (duration 150 sec): select sleep(200),id from sbtest1190319 19:57:53 >> log scanned up to (5875859480)
190319 19:57:53 Waiting for query 45 (duration 151 sec): select sleep(200),id from sbtest1190319 19:57:54 >> log scanned up to (5875859480)
190319 19:57:54 Waiting for query 45 (duration 152 sec): select sleep(200),id from sbtest1190319 19:57:55 >> log scanned up to (5875859480)
190319 19:57:55 Waiting for query 45 (duration 153 sec): select sleep(200),id from sbtest1190319 19:57:56 >> log scanned up to (5875859480)
190319 19:57:56 Waiting for query 45 (duration 154 sec): select sleep(200),id from sbtest1190319 19:57:57 >> log scanned up to (5875859480)
190319 19:57:57 Waiting for query 45 (duration 155 sec): select sleep(200),id from sbtest1190319 19:57:58 >> log scanned up to (5875859480)
190319 19:57:58 Waiting for query 45 (duration 156 sec): select sleep(200),id from sbtest1190319 19:57:59 >> log scanned up to (5875859480)
190319 19:57:59 Waiting for query 45 (duration 157 sec): select sleep(200),id from sbtest1190319 19:58:00 >> log scanned up to (5875859480)
190319 19:58:00 Waiting for query 45 (duration 158 sec): select sleep(200),id from sbtest1190319 19:58:01 >> log scanned up to (5875859480)
190319 19:58:01 Unable to obtain lock. Please try again later.
备份期间其他session
root@localhost : test 07:57:45> show processlist;
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| 4 | system user | | NULL | Connect | 34494 | executing | NULL |
| 7 | system user | | NULL | Connect | 34384 | Slave has read all relay log; waiting for more updates | NULL |
| 8 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 34494 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 21 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 22 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 23 | system user | | NULL | Connect | 34493 | Waiting for an event from Coordinator | NULL |
| 45 | root | localhost | test | Query | 150 | User sleep | select sleep(200),id from sbtest1 |
| 55 | root | localhost | test | Query | 0 | starting | show processlist |
| 57 | powdba | 127.0.0.1:35810 | NULL | Sleep | 1 | | NULL |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
21 rows in set (0.00 sec)

ftwrl-wait-timeout=20

root@localhost : (none) 07:55:44> show processlist;
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| 4 | system user | | NULL | Connect | 34367 | executing | NULL |
| 7 | system user | | NULL | Connect | 34257 | Slave has read all relay log; waiting for more updates | NULL |
| 8 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 34367 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 21 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 22 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 23 | system user | | NULL | Connect | 34366 | Waiting for an event from Coordinator | NULL |
| 45 | root | localhost | test | Query | 1 | User sleep | select sleep(200),id from sbtest1 |
| 55 | root | localhost | NULL | Query | 0 | starting | show processlist |
+----+-------------+-----------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
20 rows in set (0.00 sec)


# /usr/local/xtrabackup-2.4/bin/innobackupex --defaults-file=/etc/mysql/my.cnf --user=powdba --password=abc123 --host=127.0.0.1 --ftwrl-wait-timeout=20 /data/backup/
...
190319 20:00:29 >> log scanned up to (5875859480)
190319 20:00:29 Waiting 20 seconds for queries running longer than 60 seconds to finish
190319 20:00:29 Executing FLUSH TABLES WITH READ LOCK...
190319 20:00:30 >> log scanned up to (5875859480)
190319 20:00:31 >> log scanned up to (5875859480)
190319 20:00:32 >> log scanned up to (5875859480)
190319 20:00:33 >> log scanned up to (5875859480)
190319 20:00:34 >> log scanned up to (5875859480)
190319 20:00:35 >> log scanned up to (5875859480)
190319 20:00:36 >> log scanned up to (5875859480)
190319 20:00:37 >> log scanned up to (5875859480)
190319 20:00:38 >> log scanned up to (5875859480)
190319 20:00:39 >> log scanned up to (5875859480)
190319 20:00:40 >> log scanned up to (5875859480)
190319 20:00:41 >> log scanned up to (5875859480)
190319 20:00:42 >> log scanned up to (5875859480)
190319 20:00:43 >> log scanned up to (5875859480)
190319 20:00:44 >> log scanned up to (5875859480)
190319 20:00:45 >> log scanned up to (5875859480)
190319 20:00:46 >> log scanned up to (5875859480)
190319 20:00:47 >> log scanned up to (5875859480)
190319 20:00:48 >> log scanned up to (5875859480)
190319 20:00:49 >> log scanned up to (5875859480)
190319 20:00:50 >> log scanned up to (5875859480)
190319 20:00:51 >> log scanned up to (5875859480)
190319 20:00:52 >> log scanned up to (5875859480)
备份期间其他session
root@localhost : test 08:00:32> show processlist;
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| 4 | system user | | NULL | Connect | 34655 | executing | NULL |
| 7 | system user | | NULL | Connect | 34545 | Slave has read all relay log; waiting for more updates | NULL |
| 8 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 34655 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 21 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 22 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 23 | system user | | NULL | Connect | 34654 | Waiting for an event from Coordinator | NULL |
| 45 | root | localhost | test | Query | 34 | User sleep | select sleep(200),id from sbtest1 |
| 55 | root | localhost | test | Query | 0 | starting | show processlist |
| 59 | powdba | 127.0.0.1:35812 | NULL | Query | 4 | Waiting for table flush | FLUSH TABLES WITH READ LOCK |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
21 rows in set (0.00 sec)

###一直处于等待中
root@localhost : test 08:00:33> select * from sbtest1 limit 10;

小结

  • 设置--ftwrl-wait-timeout=20时。在执行flush table with read lock时,如果堵塞SQL此时执行时间大于参数ftwrl-wait-timeout(即10s) ,xtrabackup线程在执行flush table with read lock时会超时退出。
  • 设置--ftwrl-wait-timeout=20时。在执行flush table with read lock时,如果堵塞SQL此时执行时间小于参数ftwrl-wait-timeout(即10s) ,xtrabackup线程在执行flush table with read lock时不会超时退出,会一直备份,但是不影响业务运行。

风险

  • 当一条长查询刚开始执行,还没有到达ftwrl-wait-threshold指定的时间时,备份的FTWRL发送到了数据库,这时xtrabackup不会认为这条长查询是一条长查询。

解决方案2:设置kill-long-queries-timeout

参数解析

  • --kill-long-queries-timeout=SECONDS

    • 该选项指定innobackupex在执行FLUSH TABLES WITH READ LOCK时碰到阻塞其获得锁的查询时,等待该参数指定的秒数之后,如果仍然有查询在运行,则执行kill掉这些查询
    • 默认值为0,表示innobackupex 不启用尝试kill掉任何查询的功能
  • --kill-long-query-type=all|select

    • 该选项指定哪些类型的查询在指定的查询时间之后还没有执行完成时被kill掉,以释放阻塞加全局读锁的锁,默认值为all,有效值有:all和select
    • 执行该选项需要有process和super权限
    • 执行kill的超时时间由选项--kill-long-queries-timeout指定,默认值为0,表示禁止kill掉超时查询的功能

kill-long-queries-timeout=20

# /usr/local/xtrabackup-2.4/bin/innobackupex --defaults-file=/etc/mysql/my.cnf --user=powdba --password=abc123 --host=127.0.0.1 --kill-long-queries-timeout=20 /data/backup/
...
190319 20:07:43 Executing FLUSH TABLES WITH READ LOCK...
190319 20:07:43 Kill query timeout 20 seconds.
190319 20:07:44 >> log scanned up to (5875859480)
190319 20:07:45 >> log scanned up to (5875859480)
190319 20:07:46 >> log scanned up to (5875859480)
190319 20:07:47 >> log scanned up to (5875859480)
190319 20:07:48 >> log scanned up to (5875859480)
190319 20:07:49 >> log scanned up to (5875859480)
190319 20:07:50 >> log scanned up to (5875859480)
190319 20:07:51 >> log scanned up to (5875859480)
190319 20:07:52 >> log scanned up to (5875859480)
190319 20:07:53 >> log scanned up to (5875859480)
190319 20:07:54 >> log scanned up to (5875859480)
190319 20:07:55 >> log scanned up to (5875859480)
190319 20:07:56 >> log scanned up to (5875859480)
190319 20:07:57 >> log scanned up to (5875859480)
190319 20:07:58 >> log scanned up to (5875859480)
190319 20:07:59 >> log scanned up to (5875859480)
190319 20:08:00 >> log scanned up to (5875859480)
190319 20:08:01 >> log scanned up to (5875859480)
190319 20:08:02 >> log scanned up to (5875859480)
190319 20:08:03 Connecting to MySQL server host: 127.0.0.1, user: powdba, password: set, port: 3306, socket: /var/lib/mysql/data/sock/mysql.sock
190319 20:08:03 Killing query 45 (duration 61 sec): select sleep(200),id from sbtest1
190319 20:08:03 Kill query thread stopped
190319 20:08:03 Starting to backup non-InnoDB tables and files

root@localhost : test 08:07:01> select sleep(200),id from sbtest1;
ERROR 2013 (HY000): Lost connection to MySQL server during query
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 64
Current database: test
...
备份期间其他会话
root@localhost : test 08:13:38> show processlist;
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
| 4 | system user | | NULL | Connect | 35442 | executing | NULL |
| 7 | system user | | NULL | Connect | 35332 | Slave has read all relay log; waiting for more updates | NULL |
| 8 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 35442 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 21 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 22 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 23 | system user | | NULL | Connect | 35441 | Waiting for an event from Coordinator | NULL |
| 55 | root | localhost | test | Query | 0 | starting | show processlist |
| 70 | root | localhost | test | Query | 40 | User sleep | select sleep(200),id from sbtest1 |
| 71 | powdba | 127.0.0.1:35826 | NULL | Query | 8 | Waiting for table flush | FLUSH TABLES WITH READ LOCK |
+----+-------------+-----------------+------+---------+-------+--------------------------------------------------------+-----------------------------------+
21 rows in set (0.00 sec)

### 再次其他表数据时会堵塞,这是flush table的作用
root@localhost : test 08:13:40> select * from test limit 10;
+----+------+------+------+
| id | sid | mid | name |
+----+------+------+------+
| 1 | 86 | 0 | OU |
| 8 | 38 | 51 | Ai |
| 15 | 82 | 24 | AM |
| 22 | 68 | 47 | T7 |
| 29 | 83 | 72 | tj |
| 36 | 48 | 35 | gw |
| 43 | 89 | 73 | uJ |
| 50 | 0 | 63 | aX |
| 57 | 42 | 83 | iU |
| 64 | 43 | 7 | 3b |
+----+------+------+------

### 再次查找该表数据时会堵塞
root@localhost : test 08:11:58> select * from sbtest1 limit 10;

小结

  • 当使用xtrabackup参数kill-long-queries-timeout时,会在超过等待时间后kill掉堵塞的SQL
  • 在堵塞flush table with read lock期间,可以访问非堵塞SQL所在的表,不能访问堵塞SQL所在的表。

风险

  • 这种方式对堵塞SQL影响很大,会强制kill掉,对业务会有一定的影响

解决方案3:safe-slave-backup-timeout

参数解析

  • --safe-slave-backup-timeout=SECONDS

    • 该选项在启用 --safe-slave-backup选项时,等待状态变量Slave_open_temp_tables 转变为0值的等待时间,默认值为300秒,如果超过指定的时间状态变量Slave_open_temp_tables 还未转变为0值,则启动复制并报错终止备份

风险

  • 这种方式通过关闭sql thread的方式避免死锁的发生,但是会加大主从的延迟

解决方案4:使用脚本kill xtrabackup query的线程

image

风险:

kill query时,如果存在flush tables with read lock被一条SQL阻塞,那么kill query后,这个SQL不进行完或者被kill掉,那么之后对这条SQL所对应的表做的任何操作都会被阻塞,需要等待这条SQL执行完毕或者被kill掉或者等待lock_wait_timeout时间。