这个让我难过。我有一个维度表,其中包含大约3000万行。它是一个集群列存储。此外,此表在其代理键上具有类型为INT的主键约束。
对于给定的日期范围,检索代理键的MIN()的查询如下所示:
SELECT
MIN(DIM.OrderId)
FROM
dbo.Dim_Order AS DIM
WHERE
DIM.OrderDate >= CAST('2016-06-01' AS DATE)
AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);
这是输出:
表&#39; Dim_Order&#39;。扫描计数2,逻辑读取833,物理读取0, 预读读取0,lob逻辑读取1702561,lob物理读取0, lob预读读取0。
表&#39; Dim_Order&#39;。段读取304001,段跳过0。
(受影响的一行)
SQL Server执行时间:CPU时间= 2829毫秒,已用时间= 2876毫秒。
优化器不是使用列存储,而是选择使用非群集主键并通过嵌套循环执行密钥查找。更糟糕的是,它严重低估了返回的行数。
奇怪的是,行估计似乎与日期范围的大小成反比。
╔════════════╦══════════════════════════╗
║ Date Range ║ Estimated Number of Rows ║
╠════════════╬══════════════════════════╣
║ 1 year ║ 2.00311 ║
║ 6 months ║ 3.41584 ║
║ 1 month ║ 24.4459 ║
║ 2 weeks ║ 52.093 ║
║ 1 week ║ 99.9055 ║
║ 3 days ║ 217.632 ║
║ 1 day ║ 1088.16 ║
╚════════════╩══════════════════════════╝
此版本带有INDEX提示,几乎立即运行:
SELECT
MIN(DIM.OrderId)
FROM
dbo.Dim_Order AS DIM WITH(INDEX=CCI_Dim_Order)
WHERE
DIM.OrderDate >= CAST('2016-06-01' AS DATE)
AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);
表&#39; Dim_Order&#39;。扫描计数1,逻辑读取0,物理读取0, 预读读取0,lob逻辑读取1004,lob物理读取0,lob 预读读取0。
表&#39; Dim_Order&#39;。段读取2,段跳过0。
(受影响的一行)
SQL Server执行时间:CPU时间= 0毫秒,已用时间= 1毫秒。
我在以下版本中观察到此行为:
Microsoft SQL Server 2016(RTM) - 13.0.1601.5(X64)
Microsoft SQL Server 2016(SP1-CU5)(KB4040714) - 13.0.4451.0(X64)
下面的repro脚本将创建一个示例表,并使用2年的订单填充它,适用于2,000个客户,每天一个订单。这表示我们表中的1,462,000个样本订单,跨越24个月,每个月大约有60,000行。脚本底部的示例查询旨在演示该行为。正如您将看到的,由于某种原因,行估计非常低,并且优化器拒绝使用聚簇列存储,除非提示。
我很感激任何意见或建议。这是示例脚本。
DROP TABLE IF EXISTS dbo.Dim_Order
CREATE TABLE dbo.Dim_Order
(
OrderId INT NOT NULL
, CustomerId INT NOT NULL
, OrderDate DATE NOT NULL
, OrderTotal decimal(5,2) NOT NULL
);
WITH CTE_DATE AS
(
SELECT CAST('2016-01-01' AS DATE) AS DateValue
UNION ALL
SELECT
DATEADD(DAY, 1, D.DateValue)
FROM
CTE_DATE AS D
WHERE
D.DateValue < CAST('2017-12-31' AS DATE)
),
CTE_CUSTOMER AS
(
SELECT 1 AS CustomerId
UNION ALL
SELECT
CustomerId + 1
FROM
CTE_CUSTOMER AS D
WHERE
D.CustomerId < 2000
)
, CTE_FINAL
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY DateValue ASC, CustomerId ASC) AS OrderId
, CustomerId
, DateValue AS OrderDate
, CAST(ROUND(RAND(CHECKSUM(NEWID()))*(100-1)+1, 2) AS DECIMAL(5,2)) AS OrderTotal
FROM
CTE_DATE
CROSS JOIN CTE_CUSTOMER
)
INSERT INTO
dbo.Dim_Order
(
OrderId
, CustomerId
, OrderDate
, OrderTotal
)
SELECT
ORD.OrderId
, ORD.CustomerId
, ORD.OrderDate
, ORD.OrderTotal
FROM
CTE_FINAL AS ORD
OPTION (MAXRECURSION 32767);
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Dim_Order ON dbo.Dim_Order;
ALTER INDEX CCI_Dim_Order ON dbo.Dim_Order
REORGANIZE
WITH (COMPRESS_ALL_ROW_GROUPS = ON)
ALTER TABLE dbo.Dim_Order
ADD CONSTRAINT PK_Dim_Order PRIMARY KEY NONCLUSTERED (OrderId ASC);
RETURN;
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
MIN(DIM.OrderId)
FROM
dbo.Dim_Order AS DIM
WHERE
DIM.OrderDate = CAST('2016-06-01' AS DATE)
AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);
SELECT
MIN(DIM.OrderId)
FROM
dbo.Dim_Order AS DIM WITH(INDEX=CCI_Dim_Order)
WHERE
DIM.OrderDate >= CAST('2016-06-01' AS DATE)
AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);
答案 0 :(得分:2)
这是一个典型的row goal基数估算问题。您可以添加USE HINT ('DISABLE_OPTIMIZER_ROWGOAL')
来禁用行目标,并且应该会发现群集列存储现在的成本更低并且已选中。
该计划在PK_Dim_Order
上进行了有序扫描 - 因为它按照OrderId
的顺序处理行,并且正在寻找MIN(DIM.OrderId)
它可以在找到第一个匹配后立即停止OrderDate
上的谓词 - 它假设与月份谓词匹配的60,000行将在整个索引中均匀分布。事实上,它们都在Ids 304001
到364000
的连续范围内。
这种非相关性假设也是随着日期范围变大,估计行数下降的原因。如果将日期谓词的匹配行数加倍并且它们真正均匀地分散在索引中,则只需要读取一半的行,然后再点击一个匹配两个谓词并停止扫描。