我有一张这样的表:
id start_date end_date
01 2012-03-15 2012-04-02
01 2012-04-05 2012-05-12
01 2012-04-12 2012-05-21
02 2012-03-05 2012-06-13
03 2012-03-09 2012-03-19
03 2012-05-15 2012-08-01
此处重复记录的定义是具有相同ID AND且其start_dates彼此在31天内的任意多行。在上面的例子中,第2,3行是1的重复,第6行不是第5行的副本。我想删除重复项来获取:
id start_date end_date
01 2012-03-15 2012-04-02
02 2012-03-05 2012-06-13
03 2012-03-09 2012-03-19
03 2012-05-15 2012-08-01
我想知道在使用重复数据收集数据后,或者在sql中有一种简单的方法可以在python中更好地完成这种过滤。
答案 0 :(得分:0)
我认为无论您选择哪种语言,您都必须使用某种递归方式。以下是如何在SQL中执行此操作的示例:
DECLARE @ TABLE (ID INT, start_date DATE, end_date DATE);
INSERT @ VALUES (1,'2012-03-15','2012-04-02')
, (1,'2012-04-05','2012-05-12')
, (1, '2012-04-12', '2012-05-21')
, (2, '2012-03-05', '2012-06-13')
, (3, '2012-03-09', '2012-03-19')
, (3, '2012-04-03', '2012-05-02')
, (3, '2012-05-01', '2012-08-01')
, (3, '2012-05-16', '2012-08-02')
, (3, '2012-06-08', '2012-09-09');
WITH T AS (
SELECT id, start_date, end_date, RN
FROM (
SELECT id, start_date, end_date, ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date) RN
FROM @) S
WHERE RN = 1
UNION ALL
SELECT S.id
, CASE WHEN DATEDIFF(dd, T.start_date, S.start_date) <= 31 THEN T.start_date ELSE S.start_date END
, CASE WHEN DATEDIFF(dd, T.start_date, S.start_date) <= 31 THEN T.end_date ELSE S.end_date END
, S.RN
FROM (
SELECT id, start_date, end_date, ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date) RN
FROM @) S
JOIN T ON T.id = S.id
WHERE S.RN = T.RN+1
)
SELECT id, start_date, end_date
FROM T
GROUP BY id, start_date, end_date;
哪种方法适用于不大的样本量,但如果您查看大量行,则可能不是最有效的方法。