当前位置：首页 > news >正文

一条慢 SQL 引发的血案，索引优化远比你想象的复杂

news 2026/6/7 21:01:22

上周五下午正摸鱼呢，报警就来了：某业务接口响应时间突然飙到十几秒。顺着链路一摸，好家伙，是一条查用户订单的 SQL 在作妖。原以为加个索引就完事，结果折腾了一下午，才把这条慢 SQL 真正摁住。回头想想，索引优化这玩意儿，真的比想象中复杂得多。

那条 SQL 长这样：

SELECT * FROM orders 
WHERE status = 'paid' 
AND created_at > '2025-01-01' 
ORDER BY id DESC LIMIT 20;

当时第一反应是：status 和 created_at 都有单独索引，怎么还能慢？EXPLAIN 一看却傻了眼——type=ALL，全表扫描。细查才发现，status 字段只有三个值（'pending'、'paid'、'cancelled'），基数值极低，优化器觉得用索引还不如直接扫表。但 created_at 的范围查询又没被用上，因为执行计划最终只选了 status 索引，一过滤剩下几十万行，再按 id 倒序排序，不慢才怪。

这个教训让我决定写个小工具，专门帮自己（也帮团队里不爱看执行计划的同事）从慢查询日志里揪出索引的坑，然后给出可落地的优化建议。脚本是 Python 写的，核心逻辑不长，但藏着不少坑。

先从解析慢查询日志开始。MySQL 的慢日志格式比较固定，我用正则把 Query_time、Lock_time、具体的 SQL 和所在的库名抽出来：

import redef parse_slow_log(log_path):queries = []with open(log_path) as f:current_query = {}for line in f:# 时间信息行，例如：# Time: 2025-01-15T08:20:42.123456Zif line.startswith('# Time:'):if current_query:queries.append(current_query)current_query = {'sql': []}elif line.startswith('# Query_time:'):# 提取查询耗时和锁等待match = re.search(r'Query_time:\s*([\d.]+)\s+Lock_time:\s*([\d.]+)', line)if match:current_query['query_time'] = float(match.group(1))current_query['lock_time'] = float(match.group(2))elif line.startswith('use '):current_query['database'] = line.split()[1].strip(';')elif line.startswith('SET timestamp'):continueelse:# 去除行首注释符clean = re.sub(r'^#\s*', '', line).strip()if clean and not clean.startswith('Time:') and not clean.startswith('User@Host'):current_query['sql'].append(clean)if current_query:queries.append(current_query)# 把多行 SQL 拼起来for q in queries:q['sql'] = ' '.join(q['sql'])return [q for q in queries if q.get('sql')]

拿到 SQL 列表后，第二步是对每一条执行 EXPLAIN，而且要拿 JSON 格式的，方便读成本、执行计划细节：

import pymysqldef get_explain_json(conn, database, sql):if database:conn.select_db(database)with conn.cursor() as cursor:cursor.execute(f"EXPLAIN FORMAT=JSON {sql}")result = cursor.fetchone()return result[0] if result else None

接下来就是重头戏：生成索引建议。官方文档光说“看 type 和 key”，但实际问题太刁钻了。我给自己写了个 suggest_index 函数，它不仅看全表扫描，还会去 information_schema 里查列的基数、覆盖索引的可能性、是否有排序或临时表的坑，然后把建议打印出来：

def suggest_index(conn, database, sql, explain_json):# 获取查询中涉及的表和 WHERE 条件字段（这里简化处理，实际需用 sqlparse）tables = ...  # 假设已解析出表名和别名where_cols = ...  # 解析出条件中的列advice = []query_cost = explain_json['query_block']['cost_info']['query_cost']print(f"查询成本: {query_cost}")# 检查是否使用到临时表或文件排序extra_info = explain_json['query_block'].get('ordering_operation', {})if 'using_filesort' in str(extra_info).lower():advice.append("检测到 Using filesort，建议在 ORDER BY 字段上创建索引，避免排序。")if 'using_temporary' in str(extra_info).lower():advice.append("检测到 Using temporary，考虑为 GROUP BY 和 ORDER BY 字段建联合索引。")# 如果索引使用情况不理想（全表扫描或索引类型太差）table_scans = [t for t in explain_json['query_block'].get('table', [])]for tbl in table_scans:if tbl.get('access_type') in ('ALL', 'index'):table_name = tbl['table_name']adv = f"表 `{table_name}` 发生全扫描（type={tbl['access_type']}），"# 查看条件列的选择性if where_cols:for col in where_cols:sel = get_selectivity(conn, database, table_name, col)if sel:adv += f"列 `{col}` 选择性为 {sel:.4f}。"if sel > 0.1:  # 选择性 > 10%，索引有效adv += " 建议为该列创建索引。"else:adv += " 选择性过低，单列索引可能无效，考虑联合索引或改写 SQL。"else:adv += " 无法获取基数信息。"advice.append(adv)return advicedef get_selectivity(conn, database, table, column):"""从 information_schema 获取列的选择性（区分度）"""sql = f"""SELECT (SELECT COUNT(DISTINCT {column}) FROM {database}.{table}) / (SELECT COUNT(*) FROM {database}.{table})"""try:with conn.cursor() as cur:cur.execute(sql)result = cur.fetchone()return float(result[0]) if result and result[0] else Noneexcept:return None

这个脚本的亮点就是它不会无脑让你加索引，而是会看选择性。像之前 status 那种低基数列，它就会提醒你注意，甚至建议考虑联合索引把高基数列放在前面。而且它还会抓出 Using filesort 这种隐蔽的性能杀手——有时候就算用了索引，排序照样走全表，就是因为没建到排序字段上。

我当时跑了一遍慢查询日志，它给我那订单 SQL 的建议是：“status 选择性 0.0033，单列索引无效；建议创建 (created_at, status) 的联合索引，同时考虑将 ORDER BY id 改为 ORDER BY created_at 或为 id 增加覆盖。” 照着改完，查询时间从十几秒降到 0.03 秒，瞬间清净。

现在我把这个脚本挂在巡检流程里，每次发版前跑一下慢查询日志，有潜在问题就拦下来。索引优化就是这样，不是简单加个 KEY 就好，得看基数、看回表、看排序，甚至还得猜优化器那点小心思。希望这个工具能帮各位少踩点坑。

（我把完整脚本放附件了，有需要的自行下载折腾）

查看全文

http://www.jsqmd.com/news/970366/