检测高于阈值的近似重复项

时间:2013-05-15 09:55:53

标签: mysql

我希望能够在表中查询我怀疑可能几乎重复的记录。

我绞尽脑汁却无法想到从哪里开始,所以我尽可能地简化了问题,然后来问这里!

这是我的简化表:

CREATE TABLE sales
(
  `id1` int auto_increment primary key, 
  `amount` decimal(6,2),
  `date` datetime
);

以下是一些测试值:

INSERT INTO sales
(`amount`, `date`)
VALUES
(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(3,  '2013-05-15T12:12:00'),
(4,  '2013-05-15T12:12:12'),
(45, '2013-05-15T12:22:00'),
(4,  '2013-05-15T12:24:00'),
(8,  '2013-05-15T13:00:00'),
(9,  '2013-05-15T13:01:00'),
(10, '2013-05-15T14:00:00');

问题

我希望将销售额返回到Y以上,其邻居销售额高于Y,彼此相距X分钟。

即,从这些数据:

amt, date
(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(3,  '2013-05-15T12:12:00'),
(4,  '2013-05-15T12:12:12'),
(45, '2013-05-15T12:22:00'),
(4,  '2013-05-15T12:24:00'),
(8,  '2013-05-15T13:00:00'),
(9,  '2013-05-15T13:01:00'),
(10, '2013-05-15T14:00:00');

其中@yVal = 5@xMins = 10

预期结果将是:

(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(8,  '2013-05-15T13:00:00'),
(9,  '2013-05-15T13:01:00'),

我已将上述内容放入小提琴中:http://sqlfiddle.com/#!2/cf8fe

任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:0)

尝试这样的事情:

SELECT DISTINCT s1.* FROM sales s1
    LEFT JOIN sales s2 
    ON (s1.id1 != s2.id1 
            AND s1.amount >= s2.amount - @xVal AND s1.amount <= s2.amount + @xVal
            AND s1.date >= DATE_SUB(s2.date, INTERVAL @xMins minute) AND s1.date <= DATE_ADD(s2.date, INTERVAL @xMins minute)
    )
    WHERE
    s2.id1 is not null

扩展

修正一些错误

您的数据结果如下:

+-----+--------+---------------------+
| id1 | amount | date                |
+-----+--------+---------------------+
|   1 |  10.00 | 2013-05-15 11:11:00 | 
|   2 |  11.00 | 2013-05-15 11:11:11 |
|   4 |   3.00 | 2013-05-15 12:12:00 |
|   5 |   4.00 | 2013-05-15 12:12:12 |
|   8 |   8.00 | 2013-05-15 13:00:00 |
|   9 |   9.00 | 2013-05-15 13:01:00 |
+-----+--------+---------------------+

扩展2

    SELECT DISTINCT s1.* FROM sales s1
    LEFT JOIN sales s2
    ON (s1.id1 != s2.id1
        AND s2.amount >= @xVal
        AND s1.date >= DATE_SUB(s2.date, INTERVAL @xMins minute) AND s1.date <= DATE_ADD(s2.date, INTERVAL @xMins minute)
    )
    WHERE
    s2.id1 is not null 
    AND s1.amount >= @xVal