Query Rewrite (for Banking)

技术框架

1 数据分布分析(确认数据特征适合重写,例如有足够多的不同索引;数据达到一定量级、分布有一定的diverse);
2 查询聚类、模板抽取(正则匹配);
3 针对模板进行重写(calcite重写规则、学习型规则应用算法);
Group
Rule
Aggregate_group
AGGREGATE_EXPAND_DISTINCT_AGGREGATES
AGGREGATE_EXPAND_DISTINCT_AGGREGATES_TO_JOIN
AGGREGATE_JOIN_TRANSPOSE_EXTENDED
AGGREGATE_PROJECT_MERGE
AGGREGATE_ANY_PULL_UP_CONSTANTS
AGGREGATE_UNION_AGGREGATE
AGGREGATE_UNION_TRANSPOSE
AGGREGATE_VALUES
AGGREGATE_INSTANCE
Filter_group
FILTER_AGGREGATE_TRANSPOSE
FILTER_CORRELATE
FILTER_INTO_JOIN
JOIN_CONDITION_PUSH
FILTER_MERGE
FILTER_MULTI_JOIN_MERGE
FILTER_PROJECT_TRANSPOSE
FILTER_SET_OP_TRANSPOSE
FILTER_TABLE_FUNCTION_TRANSPOSE
FILTER_SCAN
FILTER_REDUCE_EXPRESSIONS
PROJECT_REDUCE_EXPRESSIONS
FILTER_INSTANCE
Join_group
JOIN_EXTRACT_FILTER
JOIN_PROJECT_BOTH_TRANSPOSE
JOIN_PROJECT_LEFT_TRANSPOSE
JOIN_PROJECT_RIGHT_TRANSPOSE
JOIN_LEFT_UNION_TRANSPOSE
JOIN_RIGHT_UNION_TRANSPOSE
SEMI_JOIN_REMOVE
JOIN_REDUCE_EXPRESSIONS
JOIN_LEFT_INSTANCE
JOIN_RIGHT_INSTANCE
Project_group
PROJECT_CALC_MERGE
PROJECT_CORRELATE_TRANSPOSE
PROJECT_MERGE
PROJECT_MULTI_JOIN_MERGE
PROJECT_REMOVE
PROJECT_TO_CALC
PROJECT_SUB_QUERY_TO_CORRELATE
PROJECT_REDUCE_EXPRESSIONS
PROJECT_INSTANCE
Calculate_group
CALC_MERGE
CALC_REMOVE
Sort_group
SORT_JOIN_TRANSPOSE
SORT_PROJECT_TRANSPOSE
SORT_UNION_TRANSPOSE
SORT_REMOVE_CONSTANT_KEYS
SORT_REMOVE
SORT_INSTANCE
SORT_FETCH_ZERO_INSTANCE
Set_group
UNION_MERGE
UNION_REMOVE
UNION_TO_DISTINCT
UNION_PULL_UP_CONSTANTS
UNION_INSTANCE
INTERSECT_INSTANCE
MINUS_INSTANCE
4 针对所有查询进行重写(直接进行模板替换);
5 report重写结果。

数据集概览

数据库
数据集大小(采样)
数据表数目
SQL数目
orders
1.95GB
35
1328

第一步 数据分布

表名
数据量
索引数目
order_operate_log
7484302
3
order_sale_serial
1137396
4
order_auditbiz_exam_item
863317
4
order_auditbiz_info
259631
3
order_info
243344
7
order_auditbiz_exam
225858
3
order_sub_info
223923
6
order_after_sale_log
17343
1
order_exchange_serial
12020
5
order_after_sale
4316
6
order_refund
3864
4
order_refund_item
3684
7
order_returns_item
2688
7
order_returns
2642
4
order_coupon
641
3
jd_order_item
288
4
order_auditbiz_detail
278
3
jd_order_info
272
2
order_config
192
0
order_dict
142
0
order_returns_photo
118
2
order_config_log
49
3
message_no_read
18
1
message_record
6
1
order_sequence
5
1
sale_cat_dic
0
1
order_time_stock
0
0
order_item_stock
0
3
order_item_appraise
0
3
order_invoice
0
0
order_info_old
0
2
order_channel_dic
0
1
order_auditbiz_item
0
0
order_appraise
0
0
order_address
0
0
O1. Top 18大表都至少创建了一个索引,便于进行查询重写获得收益;
O2. 有十张表为空,相关的查询语句应该被移除,因为q这些查询执行结果大概率为0;
O3. 有两张非空表没有创建索引,相关数据列不应该进行“谓词优化”

第二步 查询聚类、模板抽取

基于第一步中的数据分析结果(如不能包含10张空表),对于所有的1328条SQL语句,我们过滤出83条模板:
因为4-5个tokens以内的SQL一般不需要复杂的查询重写能力,本次我们主要提取 token数目在10以及10以上 的SQL模板(共计53条)进行重写。

第三步 针对模板进行查询重写

针对筛选出来的53条查询模板,基于calcite中的31条查询重写规则(自定义为规则组),我们利用自研的MCTS查询重写算法进行模板重写,总体收益如下:
Template 1 (6条对应SQL)
重写前
select count(*) from (select a.order_id from order_info a left join order_sub_info b on a.order_id = b.order_id WHERE a.status = 0 and ( b.employee_dept_code like '2311%' or b.employee_dept_code like '3311%' ) group by order_id ) t1;
重写后
SELECT COUNT(*) FROM (SELECT order_id FROM order_info WHERE status = 0 GROUP BY order_id) AS t61 INNER JOIN (SELECT order_id FROM order_sub_info WHERE employee_dept_code LIKE '2311%' OR employee_dept_code LIKE '3311%' GROUP BY order_id) AS t63 ON t61.order_id = t63.order_id;
Template 2 (1条对应SQL)
重写前
SELECT count( case when a.sub_order_status = '01' then 1 end ) as un_paid_count, count( case when a.sub_order_status = '02' then 1 end ) as un_delivered_count, count( case when a.sub_order_status = '03' then 1 end) as delivered_count from (select t1.sub_order_status, t1.order_id from order_sub_info t1 LEFT JOIN order_info t2 on t1.order_id = t2.order_id WHERE t2.customer_id = 'ae5962b755f74142b38e4edcb7d92d49' and t1.province_info = 'HN' and t1.sub_order_type != '04' GROUP BY t1.sub_order_status, t1.order_id) a;
重写后
SELECT
COUNT(
CASE
WHEN t1236.sub_order_status = '01' THEN 1
ELSE NULL
END
) AS un_paid_count,
COUNT(
CASE
WHEN t1236.sub_order_status = '02' THEN 1
ELSE NULL
END
) AS un_delivered_count,
COUNT(
CASE
WHEN t1236.sub_order_status = '03' THEN 1
ELSE NULL
END
) AS delivered_count
FROM
(
SELECT
order_id,
sub_order_status
FROM
order_sub_info
WHERE
province_info = 'HN'
AND sub_order_type <> '04'
GROUP BY
order_id,
sub_order_status
) AS t1236
INNER JOIN (
SELECT
order_id
FROM
order_info
WHERE
customer_id = 'ae5962b755f74142b38e4edcb7d92d49'
GROUP BY
order_id
) AS t1238 ON t1236.order_id = t1238.order_id;
Template 3 (10条对应SQL)
重写前
SELECT
osi.order_id,
osi.sub_order_id,
osi.order_total_pay_price,
osi.create_time
FROM
order_sub_info osi
WHERE
order_total_pay_price > 0
AND order_id IN (
SELECT
order_id
FROM
order_info
WHERE
is_invoices = '32'
AND order_id IN (
SELECT
DISTINCT order_id
FROM
order_auditbiz_info
WHERE
consignee_telphone = '15888276378'))
AND sub_order_status IN ('02', '03', '05', '08');
重写后
SELECT
t933.order_id,
t933.sub_order_id,
t933.order_total_pay_price,
t933.create_time
FROM (
SELECT
*
FROM
order_sub_info
WHERE
order_total_pay_price > 0
AND sub_order_status IN ('02', '03', '05', '08') ) AS t933,
(SELECT
t934.order_id
FROM (
SELECT
*
FROM
order_info
WHERE
is_invoices = '32'
) AS t934, (
SELECT
order_id
FROM (
SELECT
order_id
FROM
order_auditbiz_info
WHERE
consignee_telphone = '15888276378'
GROUP BY
order_id ) AS t937
GROUP BY
order_id ) AS t938
WHERE
t934.order_id = t938.order_id
GROUP BY
t934.order_id ) AS t941
WHERE
t933.order_id = t941.order_id;
Template 4 (2条对应SQL)
重写前
select
t.*,
h.store_id,
q.create_time,
q.order_delevery_cat
from
order_auditbiz_info t
left join order_sub_info h on t.sub_order_id = h.sub_order_id
left join order_info q on q.order_id = t.order_id
where
1 = 1
and (t.employee_dept_code like '38718770900000000000')
order by
t.create_time DESC
limit
0, 10;
重写后
SELECT
"order_auditbiz_info89".*,
"order_sub_info91"."store_id",
"order_info47"."create_time" AS "create_time0",
"order_info47"."order_delevery_cat"
FROM
"order_auditbiz_info" AS "order_auditbiz_info89"
LEFT JOIN "order_sub_info" AS "order_sub_info91" ON "order_auditbiz_info89"."sub_order_id" = "order_sub_info91"."sub_order_id"
LEFT JOIN "order_info" AS "order_info47" ON "order_auditbiz_info89"."order_id" = "order_info47"."order_id"
WHERE
1 = 1
AND "order_auditbiz_info89"."employee_dept_code" LIKE '38718770900000000000'
ORDER BY
"order_auditbiz_info89"."create_time" DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;