一、概述

为了加速在线分析处理,除了物化视图,Calcite 还引入 Lattice 和 Tile 的概念。Lattice 通过定义 model 的方式,来描述星型查询,之后通过 model 自动生成多个物化结果,查询改写时会使用这些物化结果。Lattice 可以看做是在星型数据模型下对物化视图的推荐、创建和识别的机制。这种推荐可以根据查询的频次统计,也可以基于某些分析维度的重要等级。

二、测试

创建一个包含 Lattice 格的模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"version": "1.0",
"defaultSchema": "foodmart",
"schemas": [ {
"type": "jdbc",
"name": "foodmart",
"jdbcUser": "FOODMART",
"jdbcPassword": "FOODMART",
"jdbcUrl": "jdbc:hsqldb:res:foodmart",
"jdbcSchema": "foodmart"
},
{
"name": "adhoc",
"lattices": [ {
"name": "star",
"sql": [
"select 1 from \"foodmart\".\"sales_fact_1997\" as \"s\"",
"join \"foodmart\".\"product\" as \"p\" using (\"product_id\")",
"join \"foodmart\".\"time_by_day\" as \"t\" using (\"time_id\")",
"join \"foodmart\".\"product_class\" as \"pc\" on \"p\".\"product_class_id\" = \"pc\".\"product_class_id\""
],
"auto": true,
"algorithm": true,
"rowCountEstimate": 86837,
"defaultMeasures": [ {
"agg": "count"
} ]
} ]
} ]
}

连接时 Calcite 会运行优化算法,并创建和填充物化视图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
sqlline> select "the_year","the_month", count(*) as c
. . . .> from "sales_fact_1997"
. . . .> join "time_by_day" using ("time_id")
. . . .> group by "the_year","the_month";
+----------+-----------+------+
| the_year | the_month | C |
+----------+-----------+------+
| 1997 | September | 6663 |
| 1997 | April | 6590 |
| 1997 | January | 7034 |
| 1997 | June | 6912 |
| 1997 | August | 7038 |
| 1997 | February | 6844 |
| 1997 | March | 7710 |
| 1997 | October | 6479 |
| 1997 | May | 6866 |
| 1997 | December | 8717 |
| 1997 | July | 7752 |
| 1997 | November | 8232 |
+----------+-----------+------+
12 rows selected (0.147 seconds)

sqlline> explain plan for
. . . .> select "the_year","the_month", count(*) as c
. . . .> from "sales_fact_1997"
. . . .> join "time_by_day" using ("time_id")
. . . .> group by "the_year","the_month";
+--------------------------------------------------------------------------------+
| PLAN |
+--------------------------------------------------------------------------------+
| EnumerableCalc(expr#0..2=[{inputs}], the_year=[$t1], the_month=[$t0], C=[$t2]) |
| EnumerableAggregate(group=[{3, 4}], C=[$SUM0($7)]) |
| EnumerableTableScan(table=[[adhoc, m{16, 17, 27, 31, 32, 36, 37}]]) |
+--------------------------------------------------------------------------------+

查询给出了正确的答案,但是不读取 sales_fact_1997time_by_day 表,而是从名为 m{16, 17, 27, 31, 32, 36, 37} 的表中读取。这是在连接开始时创建的 Tile 块之一。

这是一个真实的表,可以直接查询。它只有 120 行,因此是返回查询的更有效方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
sqlline> !describe "adhoc"."m{16, 17, 27, 31, 32, 36, 37}"
+-------------+-------------------------------+--------------------+-----------+-----------------+
| TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME |
+-------------+-------------------------------+--------------------+-----------+-----------------+
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | recyclable_package | 16 | BOOLEAN |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | low_fat | 16 | BOOLEAN |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | product_family | 12 | VARCHAR(30) |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | the_month | 12 | VARCHAR(30) |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | the_year | 5 | SMALLINT |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | quarter | 12 | VARCHAR(30) |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | fiscal_period | 12 | VARCHAR(30) |
| adhoc | m{16, 17, 27, 31, 32, 36, 37} | m0 | -5 | BIGINT NOT NULL |
+-------------+-------------------------------+--------------------+-----------+-----------------+

sqlline> select count(*) as c
. . . .> from "adhoc"."m{16, 17, 27, 31, 32, 36, 37}";
+-----+
| C |
+-----+
| 120 |
+-----+
1 row selected (0.12 seconds)

列出表格,看到更多的 Tile 块。还有模式表 foodmart、系统表 TABLESCOLUMNS,以及 Lattice 格本身,它显示的名称为 star

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
sqlline> !tables
+-------------+-------------------------------+--------------+
| TABLE_SCHEM | TABLE_NAME | TABLE_TYPE |
+-------------+-------------------------------+--------------+
| adhoc | m{16, 17, 18, 32, 37} | TABLE |
| adhoc | m{16, 17, 19, 27, 32, 36, 37} | TABLE |
| adhoc | m{4, 7, 16, 27, 32, 37} | TABLE |
| adhoc | m{4, 7, 17, 27, 32, 37} | TABLE |
| adhoc | m{7, 16, 17, 19, 32, 37} | TABLE |
| adhoc | m{7, 16, 17, 27, 30, 32, 37} | TABLE |
| adhoc | star | STAR |
| foodmart | customer | TABLE |
| foodmart | product | TABLE |
| foodmart | product_class | TABLE |
| foodmart | promotion | TABLE |
| foodmart | region | TABLE |
| foodmart | sales_fact_1997 | TABLE |
| foodmart | store | TABLE |
| foodmart | time_by_day | TABLE |
| metadata | COLUMNS | SYSTEM_TABLE |
| metadata | TABLES | SYSTEM_TABLE |
+-------------+-------------------------------+--------------+

三、实现

Lattice 格是用于创建和填充物化视图以及识别物化视图可用于解决特定查询的框架。Calcite 针对物化视图对 Lattice 进行扩展,根据用户定义的关联和聚合要求,划分出多个物化视图来适应不同类别的查询,支持自动划分物化视图。

  • 可声明主键和外键约束
  • 帮助查询优化器将用户查询映射到过滤器连接聚合物化视图
  • 为 Calcite 提供了一个框架,用于收集有关数据量和用户查询的统计信息;
  • 允许 Calcite 自动设计和填充物化视图。

3.1. Lattice 生成

3.1.1. 自动生成 LatticeSuggester

如果你定义了一个 Lattice 格,Calcite 将在该 Lattice 格内进行自调整。但是如果没有输入格子建议器,它根据传入的查询构建 Lattice 格,并创建一个具有以下架构的模型 "autoLattice": true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"version": "1.0",
"defaultSchema": "foodmart",
"schemas": [ {
"type": "jdbc",
"name": "foodmart",
"jdbcUser": "FOODMART",
"jdbcPassword": "FOODMART",
"jdbcUrl": "jdbc:hsqldb:res:foodmart",
"jdbcSchema": "foodmart"
}, {
"name": "adhoc",
"autoLattice": true
} ]
}

当执行查询时,Calcite 将开始根据这些查询构建 Lattice 格。每个格都基于特定的事实表。当它在该事实表上看到更多查询时,它将演化 Lattice 格,将更多维度表连接到星形,并添加度量。

每个 Lattice 将根据数据和查询进行自我优化。目标是创建合理小的汇总表(tiles),这些表基于更频繁使用的属性和度量。

此功能仍处于实验阶段,但有潜力使数据库比以前更具“自我调优”能力。

3.2. Tile 生成

Tile 是 Lattice 中的一个逻辑的物化视图,它可以通过三种方法来实体化

  1. 在 Lattice 中声明
  2. 通过推荐算法实现
  3. 在响应查询时创建

3.2.1. Lattice 中声明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"version": "1.0",
"defaultSchema": "foodmart",
"schemas": [ {
"type": "jdbc",
"name": "foodmart",
"jdbcUser": "FOODMART",
"jdbcPassword": "FOODMART",
"jdbcUrl": "jdbc:hsqldb:res:foodmart",
"jdbcSchema": "foodmart"
},
{
"name": "adhoc",
"lattices": [ {
...
} ],
"tiles": [ {
"dimensions": [ "the_year", ["t", "quarter"] ],
"measures": [ {
"agg": "sum",
"args": "unit_sales"
}, {
"agg": "sum",
"args": "store_sales"
}, {
"agg": "count"
} ]
} ]
} ]
} ]
}

3.2.2. 推荐算法实现 TileSuggester

  1. 统计信息 SqlStatistic

    选择要具体化 Lattice 格的哪些 Tile 块的算法取决于大量统计信息。它需要知道它正在考虑实现的每个列组合 (a, b, c) 的 select count(distinct a, b, c) from star 。因此,该算法在具有许多行和列的模式上需要很长时间。

3.2.3. 响应查询时创建