【作者】

王栋:携程技术保障中心数据库专家,对数据库疑难问题的排查和数据库自动化智能化运维工具的开发有强烈的兴趣。

【问题描述】

最近碰到有台MySQL实例出现了MySQL服务短暂hang死,表现为瞬间的并发线程上升,连接数暴增。
排查Error Log文件中有page_cleaner超时的信息,引起我们的关注:

2019-08-24T23:47:09.361836+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 24915ms. The settings might not be optimal. (flushed=182 and evicted=0, during the time.)
2019-08-24T23:47:16.211740+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4849ms. The settings might not be optimal. (flushed=240 and evicted=0, during the time.)
2019-08-24T23:47:23.362286+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6151ms. The settings might not be optimal. (flushed=215 and evicted=0, during the time.)

 

【问题分析】

1、 error log中page_cleaner信息是如何产生的

通过源码storage/innobase/buf/buf0flu.cc可以看出,当满足curr_time > next_loop_time + 3000条件(大于4秒)时,会打印上面信息到error log中。next_loop_time为1000ms,即1秒钟做一次刷新页的操作。

 if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {                      ulint curr_time = ut_time_ms();                       if (curr_time > next_loop_time + 3000) {                             if (warn_count == 0) {                                    ib::info() << "page_cleaner: 1000ms"                                           " intended loop took "                                           << 1000 + curr_time                                              - next_loop_time                                           << "ms. The settings might not"                                           " be optimal. (flushed="                                           << n_flushed_last                                           << " and evicted="                                           << n_evicted                                           << ", during the time.)";                                    if (warn_interval > 300) {                                           warn_interval = 600;                                    } else {                                           warn_interval *= 2;                                    }                                     warn_count = warn_interval;                             } else {                                    --warn_count;                             }                      } else {                             /* reset counter */                             warn_interval = 1;                             warn_count = 0;                      }                       next_loop_time = curr_time + 1000;                      n_flushed_last = n_evicted = 0;               }  

后半部分(flushed=182 and evicted=0, during the time.)日志,对应n_flushed_lastn_evicted两个变量,这两个变量来自于下面2个变量。

 n_evicted += n_flushed_lru;
n_flushed_last += n_flushed_list;

 

从源码注释中可以看出,n_flushed_lru表示从LRU list尾部刷新的页数,也就是日志中如evicted=0指标的所表示的值。
n_flushed_list:这个是从flush_list刷新列表中刷新的页数,也就是脏页数,日志中flushed=182的值。