Question

我有数据样本表，带有时间戳和一些数据。每个表在时间戳上都有一个聚簇索引，然后是一个特定于数据的键。数据样本不一定等距。

我需要在特定时间范围内对数据进行下采样以绘制图形 - 例如，从100,000行到N，其中N大约为50.虽然我可能不得不对算法的“正确性”做出妥协。从DSP的角度来看，出于性能原因，我想将其保留在SQL中。

我目前的想法是将时间范围内的样本分组为N个框，然后取每个组的平均值。在SQL中实现此目的的一种方法是将分区函数应用于范围从0到N-1（包括）的日期，然后是GROUP BY和AVG。

我认为这个GROUP BY可以在没有排序的情况下执行，因为日期来自聚集索引，而分区函数是单调的。但是，SQL Server似乎没有注意到这一点，它发出的代码占执行成本的78％（在下面的示例中）。假设我是对的，这种方式是不必要的，我可以使查询速度提高5倍。

有没有办法强制SQL Server跳过排序？或者有更好的方法来解决问题吗？

干杯。本

IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test

CREATE TABLE test
(
  date DATETIME NOT NULL,
  v FLOAT NOT NULL,
  CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)

INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)

-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare @min float
set @min = cast((select min(date) from test) as float)

-- similarly for max
declare @max float
set @max = cast((select max(date) from test) as float)

-- the number of results to return (assuming enough data is available)
declare @count int
set @count = 3

-- precompute scale factor
declare @scale float
set @scale =  (@count - 1) / (@max - @min)
select @scale

-- this scales the dates from 0 to n-1
select (cast(date as float) - @min) * @scale, v from test

-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - @min) * @scale, 0), avg(v) from test
group by round((cast(date as float) - @min) * @scale, 0)

Answer 1

SQL Server确实无法知道date群集密钥可以用于round(cast.. as float))这样的表达式来保证订单。只有这一点，并将它抛在轨道上。加入(... -@min) * @scale，你就得到了一个完美的混乱。如果需要按这样的表达式进行排序和分组，请将它们存储在持久计算列中并由它们索引。您可能希望使用DATEPART，因为像float这样的不精确类型可能会使表达式对于持久计算列无法使用。

<强>更新

关于date和float等同的主题：

declare @f float, @d datetime;
select @d = cast(1 as datetime);
select @f = cast(1 as float);
select cast(@d as varbinary(8)), cast(@f as varbinary(8)), @d, cast(@d as float)

产生这个：

0x0000000100000000  0x3FF0000000000000  1900-01-02 00:00:00.000 1

所以你可以看到它们都存储在8个字节上（至少float(25...53)），datetime的内部表示不是float，整数部分是白天和小数部分是时间（通常假设）。

举另一个例子：

declare @d datetime;
select @d = '1900-01-02 12:00 PM';
select cast(@d as varbinary(8)), cast(@d as float)

0x0000000100C5C100  1.5

再次将@d投射到float的结果是1.5，但0x0000000100C5C100的日期时间内部表示将是IEEE双值2.1284E-314，而不是1.5。

Answer 2

是的，SQL-Server在这种时间分区摘要SELECT中总是遇到一些问题。 Analysis Services有多种方法来处理它，但数据服务方面更有限。

我建议您尝试（我无法尝试或从此处测试任何内容）是创建包含yor分区定义的辅助“分区表”，然后加入它。你需要一些数学指标让他有机会工作：

Answer 3

两个问题：

此查询需要多长时间？

你确定它正在排序日期吗？计划中的哪个位置是对日期进行排序？分区后？那是我的猜测。我怀疑它就像它做的第一件事......也许是它需要再次排序的分区或组的方式。

无论如何，即使它确实对已排序的列表进行了排序，也不会认为它需要很长时间，因为它已经排序了...

避免在SQL Server GROUP BY中进行不必要的排序？

3 个答案: