Query Rewrite (for Banking)

1 数据分布分析(确认数据特征适合重写,例如有足够多的不同索引;数据达到一定量级、分布有一定的diverse);
2 查询聚类、模板抽取(正则匹配);
3 针对模板进行重写(calcite重写规则、学习型规则应用算法);
Group | Rule |
Aggregate_group | AGGREGATE_EXPAND_DISTINCT_AGGREGATES |
| AGGREGATE_EXPAND_DISTINCT_AGGREGATES_TO_JOIN |
| AGGREGATE_JOIN_TRANSPOSE_EXTENDED |
| AGGREGATE_PROJECT_MERGE |
| AGGREGATE_ANY_PULL_UP_CONSTANTS |
| AGGREGATE_UNION_AGGREGATE |
| AGGREGATE_UNION_TRANSPOSE |
| AGGREGATE_VALUES |
| AGGREGATE_INSTANCE |
Filter_group | FILTER_AGGREGATE_TRANSPOSE |
| FILTER_CORRELATE |
| FILTER_INTO_JOIN |
| JOIN_CONDITION_PUSH |
| FILTER_MERGE |
| FILTER_MULTI_JOIN_MERGE |
| FILTER_PROJECT_TRANSPOSE |
| FILTER_SET_OP_TRANSPOSE |
| FILTER_TABLE_FUNCTION_TRANSPOSE |
| FILTER_SCAN |
| FILTER_REDUCE_EXPRESSIONS |
| PROJECT_REDUCE_EXPRESSIONS |
| FILTER_INSTANCE |
Join_group | JOIN_EXTRACT_FILTER |
| JOIN_PROJECT_BOTH_TRANSPOSE |
| JOIN_PROJECT_LEFT_TRANSPOSE |
| JOIN_PROJECT_RIGHT_TRANSPOSE |
| JOIN_LEFT_UNION_TRANSPOSE |
| JOIN_RIGHT_UNION_TRANSPOSE |
| SEMI_JOIN_REMOVE |
| JOIN_REDUCE_EXPRESSIONS |
| JOIN_LEFT_INSTANCE |
| JOIN_RIGHT_INSTANCE |
Project_group | PROJECT_CALC_MERGE |
| PROJECT_CORRELATE_TRANSPOSE |
| PROJECT_MERGE |
| PROJECT_MULTI_JOIN_MERGE |
| PROJECT_REMOVE |
| PROJECT_TO_CALC |
| PROJECT_SUB_QUERY_TO_CORRELATE |
| PROJECT_REDUCE_EXPRESSIONS |
| PROJECT_INSTANCE |
Calculate_group | CALC_MERGE |
| CALC_REMOVE |
Sort_group | SORT_JOIN_TRANSPOSE |
| SORT_PROJECT_TRANSPOSE |
| SORT_UNION_TRANSPOSE |
| SORT_REMOVE_CONSTANT_KEYS |
| SORT_REMOVE |
| SORT_INSTANCE |
| SORT_FETCH_ZERO_INSTANCE |
Set_group | UNION_MERGE |
| UNION_REMOVE |
| UNION_TO_DISTINCT |
| UNION_PULL_UP_CONSTANTS |
| UNION_INSTANCE |
| INTERSECT_INSTANCE |
| MINUS_INSTANCE |
4 针对所有查询进行重写(直接进行模板替换);
5 report重写结果。
数据库 | 数据集大小(采样) | 数据表数目 | SQL数目 |
orders | 1.95GB | 35 | 1328 |
表名 | 数据量 | 索引数目 |
order_operate_log | 7484302 | 3 |
order_sale_serial | 1137396 | 4 |
order_auditbiz_exam_item | 863317 | 4 |
order_auditbiz_info | 259631 | 3 |
order_info | 243344 | 7 |
order_auditbiz_exam | 225858 | 3 |
order_sub_info | 223923 | 6 |
order_after_sale_log | 17343 | 1 |
order_exchange_serial | 12020 | 5 |
order_after_sale | 4316 | 6 |
order_refund | 3864 | 4 |
order_refund_item | 3684 | 7 |
order_returns_item | 2688 | 7 |
order_returns | 2642 | 4 |
order_coupon | 641 | 3 |
jd_order_item | 288 | 4 |
order_auditbiz_detail | 278 | 3 |
jd_order_info | 272 | 2 |
order_config | 192 | 0 |
order_dict | 142 | 0 |
order_returns_photo | 118 | 2 |
order_config_log | 49 | 3 |
message_no_read | 18 | 1 |
message_record | 6 | 1 |
order_sequence | 5 | 1 |
sale_cat_dic | 0 | 1 |
order_time_stock | 0 | 0 |
order_item_stock | 0 | 3 |
order_item_appraise | 0 | 3 |
order_invoice | 0 | 0 |
order_info_old | 0 | 2 |
order_channel_dic | 0 | 1 |
order_auditbiz_item | 0 | 0 |
order_appraise | 0 | 0 |
order_address | 0 | 0 |
O1. Top 18大表都至少创建了一个索引,便于进行查询重写获得收益;
O2. 有十张表为空,相关的查询语句应该被移除,因为q这些查询执行结果大概率为0;
O3. 有两张非空表没有创建索引,相关数据列不应该进行“谓词优化”
基于第一步中的数据分析结果(如不能包含10张空表),对于所有的1328条SQL语句,我们过滤出83条模板:

因为4-5个tokens以内的SQL一般不需要复杂的查询重写能力,本次我们主要提取 token数目在10以及10以上 的SQL模板(共计53条)进行重写。
针对筛选出来的53条查询模板,基于calcite中的31条查询重写规则(自定义为规则组),我们利用自研的MCTS查询重写算法进行模板重写,总体收益如下:
Template 1 (6条对应SQL)
重写前
select count(*) from (select a.order_id from order_info a left join order_sub_info b on a.order_id = b.order_id WHERE a.status = 0 and ( b.employee_dept_code like '2311%' or b.employee_dept_code like '3311%' ) group by order_id ) t1;
重写后
SELECT COUNT(*) FROM (SELECT order_id FROM order_info WHERE status = 0 GROUP BY order_id) AS t61 INNER JOIN (SELECT order_id FROM order_sub_info WHERE employee_dept_code LIKE '2311%' OR employee_dept_code LIKE '3311%' GROUP BY order_id) AS t63 ON t61.order_id = t63.order_id;
Template 2 (1条对应SQL)
重写前
SELECT count( case when a.sub_order_status = '01' then 1 end ) as un_paid_count, count( case when a.sub_order_status = '02' then 1 end ) as un_delivered_count, count( case when a.sub_order_status = '03' then 1 end) as delivered_count from (select t1.sub_order_status, t1.order_id from order_sub_info t1 LEFT JOIN order_info t2 on t1.order_id = t2.order_id WHERE t2.customer_id = 'ae5962b755f74142b38e4edcb7d92d49' and t1.province_info = 'HN' and t1.sub_order_type != '04' GROUP BY t1.sub_order_status, t1.order_id) a;
重写后
SELECT
COUNT(
CASE
WHEN t1236.sub_order_status = '01' THEN 1
ELSE NULL
END
) AS un_paid_count,
COUNT(
CASE
WHEN t1236.sub_order_status = '02' THEN 1
ELSE NULL
END
) AS un_delivered_count,
COUNT(
CASE
WHEN t1236.sub_order_status = '03' THEN 1
ELSE NULL
END
) AS delivered_count
FROM
(
SELECT
order_id,
sub_order_status
FROM
order_sub_info
WHERE
province_info = 'HN'
AND sub_order_type <> '04'
GROUP BY
order_id,
sub_order_status
) AS t1236
INNER JOIN (
SELECT
order_id
FROM
order_info
WHERE
customer_id = 'ae5962b755f74142b38e4edcb7d92d49'
GROUP BY
order_id
) AS t1238 ON t1236.order_id = t1238.order_id;
Template 3 (10条对应SQL)
重写前
SELECT
osi.order_id,
osi.sub_order_id,
osi.order_total_pay_price,
osi.create_time
FROM
order_sub_info osi
WHERE
order_total_pay_price > 0
AND order_id IN (
SELECT
order_id
FROM
order_info
WHERE
is_invoices = '32'
AND order_id IN (
SELECT
DISTINCT order_id
FROM
order_auditbiz_info
WHERE
consignee_telphone = '15888276378'))
AND sub_order_status IN ('02', '03', '05', '08');
重写后
SELECT
t933.order_id,
t933.sub_order_id,
t933.order_total_pay_price,
t933.create_time
FROM (
SELECT
*
FROM
order_sub_info
WHERE
order_total_pay_price > 0
AND sub_order_status IN ('02', '03', '05', '08') ) AS t933,
(SELECT
t934.order_id
FROM (
SELECT
*
FROM
order_info
WHERE
is_invoices = '32'
) AS t934, (
SELECT
order_id
FROM (
SELECT
order_id
FROM
order_auditbiz_info
WHERE
consignee_telphone = '15888276378'
GROUP BY
order_id ) AS t937
GROUP BY
order_id ) AS t938
WHERE
t934.order_id = t938.order_id
GROUP BY
t934.order_id ) AS t941
WHERE
t933.order_id = t941.order_id;
Template 4 (2条对应SQL)
重写前
select
t.*,
h.store_id,
q.create_time,
q.order_delevery_cat
from
order_auditbiz_info t
left join order_sub_info h on t.sub_order_id = h.sub_order_id
left join order_info q on q.order_id = t.order_id
where
1 = 1
and (t.employee_dept_code like '38718770900000000000')
order by
t.create_time DESC
limit
0, 10;
重写后
SELECT
"order_auditbiz_info89".*,
"order_sub_info91"."store_id",
"order_info47"."create_time" AS "create_time0",
"order_info47"."order_delevery_cat"
FROM
"order_auditbiz_info" AS "order_auditbiz_info89"
LEFT JOIN "order_sub_info" AS "order_sub_info91" ON "order_auditbiz_info89"."sub_order_id" = "order_sub_info91"."sub_order_id"
LEFT JOIN "order_info" AS "order_info47" ON "order_auditbiz_info89"."order_id" = "order_info47"."order_id"
WHERE
1 = 1
AND "order_auditbiz_info89"."employee_dept_code" LIKE '38718770900000000000'
ORDER BY
"order_auditbiz_info89"."create_time" DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
Last modified 1mo ago