选择大规模行并按时间顺序缩放选择的有效方法?

时间:2011-12-14 02:11:16

标签: sql time time-series

我每隔5分钟就会在表中插入数据,这些列包含时间戳和数据。我想根据给定的时间范围选择数据,并且为了性能和时间顺序缩放,正确地省略了数据,以便查询返回最大值为32的数据。

例如,我有2周的数据,或4032条5分钟的分隔条目记录。我想从头到尾选择,将结果集减少到32条记录,但按时间顺序排列记录集,以便32条记录中的每个条目尽可能等距离,也留下边缘记录(开头和结束记录在集合中)不变。

我有抓取大量集合的代码,并使用计算的跳过间隔迭代它们,根据需要删除记录并执行边缘检查。我想知道在查询中是否有更快的方法来代替服务器代码。我正在使用MySQL,但我也会接受MsSQL的答案。

感谢。

3 个答案:

答案 0 :(得分:0)

在这些行中有两个日期是输入,5分钟范围和范围内的32个样本:

SELECT rownum
  FROM (SELECT @row := @row +1 AS rownum
              ,@sampleRate AS sampleRate
          FROM (SELECT @row := 0
                      ,@sampleRate := TIMESTAMPDIFF(MINUTE,'2011-12-01 00:00:00','2011-12-15 00:00:00') / 5 / 32 ) r
              ,clientpc
         ) ranked
WHERE rownum % @sampleRate = 1

答案 1 :(得分:0)

好的,不要说我没有警告你(根据上面的评论部分)。这是为MSSQL编写的;我不熟悉MySQL,所以我试图减少超级专有的东西。这可能都是在一个大丑陋的查询中完成的,但后来更难以理解,所以我把它分解为步骤。

首先,设置一些变量:

DECLARE
  @Items  real  = 32   -- How many items you wish to display
 ,@From   int = 16000  --  Low range delimiter on your target data set
 ,@Thru   int = 17500  --  High range delimiter on your target data set
 ,@Total  real         --  Used to store how many items are actually in the target range

简短测试表明,如果@Items小于2或大于@Total的某个大倍数,则会失败。需要进行错误处理或输入测试。我使用实数据类型,以便除法产生十进制值,而不是截断的整数;一定要用整数值设置这些,否则我不知道会发生什么。

下一位创建“Tally”表或“数字表”。它只是一个单列的升序整数表,从1开始,然后上升到你的上限。在这里,我把它限制在256,因为32似乎是你的最大值。 (这个特殊的代码非常钝,但它可以在令人不安的很短的时间内产生数百万行,因此每当我需要这样的东西时,我都会将其剪切掉。)

CREATE TABLE #Tally (Num  int  not null)

--  "Table of numbers" data generator, as per Itzik Ben-Gan (from multiple sources)
--  Modified to generate 1 through 256
;WITH
  L0 AS (SELECT 1 AS C UNION ALL SELECT 1), --2 rows
  L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B),--4 rows
  L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B),--16 rows
  L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B),--256 rows
  num AS (SELECT ROW_NUMBER() OVER(ORDER BY C) AS N FROM L3)
 insert #Tally (Num)
  select N FROM num

获取目标数据集中的行数:

SELECT @Total = count(*)
 from Time
 where TimeId between @From and @Thru

查看查询,按顺序列出目标范围与集合中的排名(位置,例如1,2,3,4等)。这将处理重复值。 (我的测试基于我们的通用“时间”表,它看起来像任何数据仓库中的大多数时间维度表。)

SELECT
   row_number() over (order by TimeId) Ranking
  ,TimeId
 from Time
 where TimeId between @From and @Thru

另一个评论查询。这将返回标识最终集的“断点”的数字集。例如,如果你有30个项目并想要7,那么这将产生{5,10,15,20,25,30};结合1,它是你想要的七个(如果我直接遇到问题)。

SELECT distinct ceiling((Num - 1) * @Total / (@Items - 1)) from #Tally

这是主力,包含上述两个查询。基本上,从第一个查询开始,它的排名/位置与第二个查询中标识的“断点”相同。我在第一个项目中使用了OR,因为这比尝试用数学方法填充它更简单。

SELECT xx.Ranking, xx.TimeId
 from (select
          row_number() over (order by TimeId) Ranking
         ,TimeId
        from Time
        where TimeId between @From and @Thru) xx
 where Ranking in (select distinct ceiling((Num - 1) * @Total / (@Items - 1)) from #Tally)
  or Ranking = 1

正如我所说的那样,它过于复杂,而且可能对某些输入无效 - 但是它的运行速度应该比程序选择更快。

答案 2 :(得分:0)

我想出了这个程序,任何清理工作都表示赞赏。在进行一些盲目调试之后,它就像我想要的那样工作。时间存储为UTC时间戳。

    DELIMITER $$

CREATE PROCEDURE `SelectChronoRange`(IN timeBegin BIGINT,
    IN timeEnd BIGINT)
BEGIN
    DECLARE totalAvail, skip, insideResultMax INT;
    SET @maxResults = 64;

    SELECT count(*)
    INTO totalAvail
    FROM `dediwatcherstats`;

    SET insideResultMax:= @maxResults - 2;
    SET skip := CEIL(totalAvail / insideResultMax);
    SET @firstpid = 0;
    SET @lastpid = 0;

    SELECT `pid` INTO @firstpid
    FROM `dediwatcherstats`
    WHERE
        CASE
            WHEN timeBegin IS NOT NULL AND timeEnd IS NOT NULL THEN
                `Time`>=timeBegin AND `Time`<=timeEnd
            WHEN timeEnd IS NOT NULL THEN
                `Time`<=timeEnd
            WHEN timeBegin IS NOT NULL THEN
                `Time`>=timeBegin
            ELSE
                TRUE
        END
    ORDER BY `Time` ASC, `pid` ASC LIMIT 1;

    SELECT `pid` INTO @lastpid
    FROM `dediwatcherstats`
    WHERE
        CASE
            WHEN timeBegin IS NOT NULL AND timeEnd IS NOT NULL THEN
                `Time`>=timeBegin AND `Time`<=timeEnd
            WHEN timeEnd IS NOT NULL THEN
                `Time`<=timeEnd
            WHEN timeBegin IS NOT NULL THEN
                `Time`>=timeBegin
            ELSE
                TRUE
        END
    ORDER BY `Time` DESC, `pid` DESC LIMIT 1;

    SELECT * FROM
    (
        (
            SELECT * FROM `dediwatcherstats`
            WHERE `pid`=@firstpid
        )
    UNION
        (
        SELECT * FROM `dediwatcherstats`
        WHERE
            CASE
                WHEN timeBegin IS NOT NULL AND timeEnd IS NOT NULL THEN
                    `Time`>=timeBegin AND `Time`<=timeEnd
                WHEN timeEnd IS NOT NULL THEN
                    `Time`<=timeEnd
                WHEN timeBegin IS NOT NULL THEN
                    `Time`>=timeBegin
                ELSE
                    TRUE
            END
            AND `pid` % skip=0
        LIMIT 62
        )
    ) AS notused
    UNION
        SELECT * FROM `dediwatcherstats`
        WHERE `pid`=@lastpid;
END

它适用于这个简单的表:

CREATE TABLE `dediwatcherstats` (
  `pid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `Time` bigint(20) unsigned NOT NULL,
  `Data` text,
  PRIMARY KEY (`pid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

我希望LIMIT子句允许参数变量。在我发布的代码中,我对可能想要使用它的任何人使用了64而不是32的限制。