SQL:消除在X分钟内发生的冗余记录的选择

时间:2011-10-05 12:45:32

标签: sql database firebird

使用的DB是Firebird 2.1,如果不熟悉这里是select语句sql ref:
http://ibexpert.net/ibe/index.php?n=Doc.DataRetrieval
函数ref: http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd21.html

我会对任何sql俚语感到满意[我会转换它]。

表架构:

CREATE TABLE EVENT_MASTER (
EVENT_ID                BIGINT NOT NULL,
EVENT_TIME              BIGINT NOT NULL,
DATA_F1                 VARCHAR(40),
DATA_F2                 VARCHAR(40),
PRIMARY KEY (EVENT_ID)
);

坏消息是EVENT_TIME存储为自纪元以来经过的秒数。

数据样本:

"EVENT_ID","EVENT_TIME","DATA_F1","DATA_F2"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25329,1297824773,"8604","A"
25330,1297824793,"8604","A"
25331,1297824809,"8604","1"
25332,1297824811,"8604","GREY"
25333,1297824812,"8604","A"
25334,1297824825,"8604","GREY"
25335,1297824831,"8604","A"
25336,1297824833,"8604","GREY"
25337,1297824838,"8604","A"
25338,1297824840,"8604","1"
25339,1297824850,"8604","A"
25340,1297824864,"8604","A"
25341,1297824875,"8804","GREY" //notice DATA_F1 is different
25342,1297824876,"8604","G"
25343,1297824877,"8604","A"
25344,1297824880,"8604","GREY"
25345,1297824895,"8604","1"
25346,1297824899,"8604","A"
25347,1297824918,"8604","GREY"
25348,1297824930,"8604","YELLOW"
25349,1297824939,"8604","GREY"
25350,1297824940,"8604",""
25351,1297824944,"8604","A"
25352,1297824945,"8604","1"
25353,1297824954,"8604","B"
25354,1297824958,"8604",""
25355,1297824964,"8604","1"
25356,1297824966,"8604","GREY"
25357,1297824974,"8604","1"
25358,1297824981,"8604","GREY"
25359,1297824983,"8604",""
25360,1297824998,"8604","GREY"
25361,1297825003,"8604","2"
25362,1297825009,"8604","G"
25363,1297825018,"8604","GREY"
25364,1297825026,"8604","F"
25365,1297825045,"8604","GREY"
25366,1297825046,"8604","1"

预期产量:
根据EVENT_TIME,在X分钟内不同的“DATA_F1”,“DATA_F2”行: 像:

25341,1297824875,"8804","GREY"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25332,1297824811,"8604","GREY"
25348,1297824930,"8604","YELLOW"
..etc

要求: 消除5分钟内发生的冗余记录的选择[根据EVENT_TIME列的范围计算]。

最后我正在尝试遵循这种模式:

SELECT * FROM EVENT_MASTER inner join (
SELECT distinct  DATA_F1, DATA_F2 FROM EVENT_MASTER where /*the hard stuff that i need help with: (EVENT_TIME difference within X minutes)*/
) as RemovedDup ON /*EVENT_MASTER.EVENT_ID = problem is i cant select RemovedDup ID otherwise distinct becomes useless!!*/

请尽快帮助。

感谢,

修改

根据Andrei K添加输出。回答:

25331,1297824809,"8604","1"
25327,1297824698,"8604","A"
25342,1297824876,"8604","G"
25332,1297824811,"8604","GREY"
25328,1297824770,"8604","I"
25341,1297824875,"8804","GREY"
25350,1297824940,"8604",""
25352,1297824945,"8604","1" /*bug: time still within 300 seconds, this same as first record*/
25361,1297825003,"8604","2"
25351,1297824944,"8604","A"
25353,1297824954,"8604","B"
25364,1297825026,"8604","F"
25362,1297825009,"8604","G"
25347,1297824918,"8604","GREY"
25372,1297825087,"8604","ORANGE"
25348,1297824930,"8604","YELLOW"
25382,1297825216,"8604","1"
25387,1297825270,"8604","B"
25394,1297825355,"8604","BLUE"
25381,1297825211,"8604","GREY"

编辑2: Russell查询输出:输出不好而且非常慢。

1297824698,"8604","A"
1297824770,"8604","I"
1297824809,"8604","1"
1297824811,"8604","GREY"
1297824825,"8604","GREY"
1297824840,"8604","1"
1297824875,"8804","GREY"
1297824876,"8604","G"
1297824880,"8604","GREY"
1297824918,"8604","GREY"
1297824930,"8604","YELLOW"
1297824939,"8604","GREY"
1297824940,"8604",""
1297824945,"8604","1"
1297824954,"8604","B"
1297824964,"8604","1"
1297824998,"8604","GREY"
1297825003,"8604","2"
1297825018,"8604","GREY"
1297825026,"8604","F"
1297825045,"8604","GREY"
1297825046,"8604","1"
1297825063,"8604","1"
1297825079,"8604","GREY"
1297825087,"8604","ORANGE"
1297825094,"8604","GREY"
1297825100,"8604","1"
1297825133,"8604","GREY"
1297825176,"8604","GREY"
1297825216,"8604","1"

编辑3:

基于Russell请求的

是:所有行WHERE DATA_F1 ='8604'AND DATA_F2 ='GRAY'

25332,1297824811,"8604","GREY"
25334,1297824825,"8604","GREY"
25336,1297824833,"8604","GREY"
25344,1297824880,"8604","GREY"
25347,1297824918,"8604","GREY"
25349,1297824939,"8604","GREY"
25356,1297824966,"8604","GREY"
25358,1297824981,"8604","GREY"
25360,1297824998,"8604","GREY"
25363,1297825018,"8604","GREY"
25365,1297825045,"8604","GREY"
25367,1297825059,"8604","GREY"
25371,1297825079,"8604","GREY"
25373,1297825094,"8604","GREY"
25376,1297825116,"8604","GREY"
25378,1297825133,"8604","GREY"
25380,1297825176,"8604","GREY"
25381,1297825211,"8604","GREY"
25384,1297825234,"8604","GREY"
25389,1297825286,"8604","GREY"
25390,1297825314,"8604","GREY"
25391,1297825323,"8604","GREY"
25393,1297825343,"8604","GREY"
25396,1297825370,"8604","GREY"
25397,1297825387,"8604","GREY"
25399,1297825416,"8604","GREY"
25401,1297825436,"8604","GREY"
25402,1297825445,"8604","GREY"
25404,1297825454,"8604","GREY"
50282,1299137344,"8604","GREY"
380151,1309849420,"8604","GREY"

截止到目前为止[格林威治标准时间2011年10月11日上午5点]没有发布绝对正确的答案,而安德烈·K仍然是最好的尝试。所以sql专家请帮我找到解决方案,否则我会开始认为sql无法处理问题的要求!是吗??

备注:event_time不是唯一的,因此可以在同一秒内发生多个事件。

7 个答案:

答案 0 :(得分:4)

如果冗余行表示在5分钟内注册的行并且具有相同的data_f1,则data_f2会尝试这样的事情:

SELECT
  e2.event_id,
  e2.event_time,
  e2.data_f1,
  e2.data_f2
FROM
  (SELECT trunc(event_time / 300), data_f1, data_f2, min(event_id) as e_id
   FROM event_master
   GROUP BY 1, 2, 3) e1 
  JOIN 
    event_master e2 ON e1.e_id = e2.event_id

答案 1 :(得分:2)

你可以尝试这个:::

SELECT * FROM EVENT_MASTER group by (DATAF1, DATAF2) where 
event_time >(SELECT TIME_TO_SEC(now())-300)

希望这会对你有帮助..

答案 2 :(得分:2)

我不熟悉Firebird,但我正在使用文档,所以如果这是正确的,那么这应该工作。

SELECT DISTINCT MIN(A.EVENT_TIME) as MINEVENT_TIME, B.DATA_F1, B.DATA_F2 
FROM EVENT_MASTER as A 
JOIN EVENT_MASTER as B ON A.EVENT_TIME BETWEEN B.EVENT_TIME-299 AND B.EVENT_TIME 
AND B.DATA_F1 = A.DATA_F1 AND B.DATA_F2 = A.DATA_F2 
GROUP BY B.DATA_F1, B.DATA_F2, B.EVENT_TIME 

这是语法检查但未经测试。

答案 3 :(得分:2)

这假设所有记录在event_time中具有不同的值(或者它们将彼此排除)。

SELECT
  *
FROM
  event_master AS data
WHERE
  NOT EXISTS (
    SELECT * FROM event_master
    WHERE event_time >  data.event_time - 300
      AND event_time <= data.event_time
  )

如果在event_time中使用相同的值发生多重事件,我们是否可以假设event_id更高的事件不会发生在event_id更低的事件之前?如果是这样,您可以按如下方式修改上述内容

SELECT
  *
FROM
  event_master AS data
WHERE
  NOT EXISTS (
    SELECT * FROM event_master
    WHERE event_time >  data.event_time - 300
      AND event_time <= data.event_time
      AND event_id   <  data.event_id
  )

如果同时发生多个事件,将选择event_id最低的事件。


在性能方面,请确保数据具有索引,其中event_time是第一个索引字段。

答案 4 :(得分:2)

据我了解,您希望获得DATA_F1和DATA_F2的不同值,但仅适用于5分钟的“窗口”;之后,价值可能再次出现,对吧? (对不起,如果我误解了这个问题,那是漫长的一天......)我对Firebird了解不多,但是你会在MS SQL服务器中这样做:

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
  EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b 
    ON a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND 
      a.EVENT_TIME<b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
  b.EVENT_ID IS NULL

另外,在测试时,也请尝试下面的修改版本:希望这会有所帮助!

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
  EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b 
    ON a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND
      a.EVENT_ID<b.EVENT_ID AND 
      a.EVENT_TIME<=b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
  b.EVENT_ID IS NULL

已添加:好的,好像我们有正确的结果。这是我建议优化这个宝贝(因为我看到Firebird支持EXISTS关键字,我已经重写了下面的查询):

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM EVENT_MASTER AS a 
WHERE NOT EXISTS (SELECT * FROM EVENT_MASTER AS b 
    WHERE a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND
      a.EVENT_ID<b.EVENT_ID AND 
      a.EVENT_TIME<=b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60)

另外,请添加以下索引:

CREATE INDEX IX_SPEED ON EVENT_MASTER (EVENT_ID DESC, EVENT_TIME ASC, DATA_F1 ASC, DATA_F2 ASC)

希望这有帮助!

答案 5 :(得分:1)

尝试:

SELECT T1.* FROM EVENT_MASTER T1 WHERE EXISTS (
    SELECT * FROM EVENT_MASTER T2 
    WHERE T2.DATA_F1=T1.DATA_F1 
    AND T2.DATA_F2=T1.DATA_F2 
    AND (T2.EVENT_TIME-T1.EVENT_TIME)<300
)

答案 6 :(得分:1)

你需要一个非常讨厌的递归查询来完成这个纯粹的“功能”方式。我并不自信能够巧妙地构建这样的查询,更不用说使其具有高效性。

另一方面,允许副作用(即临时表)显着简化了事情。您甚至可以通过在临时表上添加适当的索引(此处未显示)来使其快速完成。这是实际的SQL:

CREATE GLOBAL TEMPORARY TABLE EVENT_MASTER_TMP (
    EVENT_ID                BIGINT NOT NULL,
    EVENT_TIME              BIGINT NOT NULL,
    DATA_F1                 VARCHAR(40),
    DATA_F2                 VARCHAR(40),
    PRIMARY KEY (EVENT_ID)
);

INSERT INTO EVENT_MASTER_TMP
SELECT * FROM
    (SELECT * FROM EVENT_MASTER ORDER BY EVENT_TIME) E
WHERE
    NOT EXISTS (
        SELECT *
        FROM EVENT_MASTER_TMP T
        WHERE
            E.DATA_F1 = T.DATA_F1
            AND E.DATA_F2 = T.DATA_F2
            AND E.EVENT_TIME - T.EVENT_TIME <= 5*60
    );

SELECT * FROM EVENT_MASTER_TMP;

用简单的英语:

  • 完成从较旧到较新的事件,
  • 对于每个事件,检查它是否相对于临时表中已有的行是多余的
  • 如果没有,请将其插入临时表中,以便将其用作剩余事件的标准。

对您的测试数据执行此操作会产生:

25327   1297824698  8604    A
25328   1297824770  8604    I
25331   1297824809  8604    1
25332   1297824811  8604    GREY
25341   1297824875  8804    GREY
25342   1297824876  8604    G
25348   1297824930  8604    YELLOW
25350   1297824940  8604    
25353   1297824954  8604    B
25361   1297825003  8604    2
25364   1297825026  8604    F

将时间阈值从5*60降低到比如233,产生这个:

25327   1297824698  8604    A
25328   1297824770  8604    I
25331   1297824809  8604    1
25332   1297824811  8604    GREY
25341   1297824875  8804    GREY
25342   1297824876  8604    G
25348   1297824930  8604    YELLOW
25350   1297824940  8604    
25351   1297824944  8604    A       <-- 246s difference
25353   1297824954  8604    B
25361   1297825003  8604    2
25364   1297825026  8604    F
25365   1297825045  8604    GREY    <-- 234s difference
25366   1297825046  8604    1       <-- 237s difference