SQLite Greatest-N-Per-Group用于清理导入的数据

时间:2014-06-25 15:54:20

标签: sql sqlite greatest-n-per-group

我有一个任务是编写SQL来清理和删除导入的数据集。

 pk1 | pk2 | timestamp | value1 | value2 | value3 | etc
-----+-----+-----------+--------+--------+--------+----- 

  1  |  2  |    123    |   1    |   2    |   5    |  ...
  1  |  2  |    124    |   1    |   2    |   4    |  ...
  1  |  2  |    125    |   1    |   2    |   3    |  ...   Either this row
  1  |  2  |    125    |   1    |   2    |   2    |  ...   Or this row (arbitrary)

  3  |  2  |    123    |   1    |   2    |   5    |  ...
  3  |  2  |    123    |   1    |   2    |   4    |  ...
  3  |  2  |    124    |   1    |   2    |   3    |  ...
  3  |  2  |    125    |   1    |   2    |   2    |  ...   Only this row

两个pk字段是复合主键。

timestamp字段标识数据的生成时间。

pk1, pk2我需要一行,最高timestamp优先。仍然可能存在重复项(1, 2, 125在上面的集合中出现两次),此时应该选择任意行并且字段集表示这是任意选择。

我有MySQL和RDBMS的答案,支持ANALYTICAL_FUNCTIONS()......


MySQL:

SELECT
  import.*,
  CASE WHEN COUNT(*) = 1 THEN 0 ELSE 1 END   AS AS duplicate_warning
FROM
  import
INNER JOIN
(
  SELECT pk1, pk2, MAX(timestamp) AS timestamp
    FROM import
GROUP BY pk1, pk2
)
  AS import_lookup
    ON  import_lookup.pk1       = import_lookup.pk1
    AND import_lookup.pk2       = import_lookup.pk2
    AND import_lookup.timestamp = import_lookup.timestamp
GROUP BY
  import.pk1,
  import.pk2

ANALYTICAL_FUNCTIONS():

SELECT
  sorted_import.*
FROM
(
  SELECT
    import.*,
    CASE WHEN
      COUNT(*)       OVER (PARTITION BY pk1, pk2, timestamp) = 1
      AND
      MAX(timestamp) OVER (PARTITION BY pk1, pk2)            = timestamp
    THEN
      0
    ELSE
      ROW_NUMBER() OVER (PARTITION BY pk1, pk2 ORDER BY timestamp DESC)
    END  AS duplicate_warning
  FROM
    import
)
  AS sorted_import
WHERE
  sorted_import.duplicate_warning IN (0, 1)


如何使用SQLite实现这一目标?

一个限制(我不做这些规则):不能使用临时表或自动增加字段。

1 个答案:

答案 0 :(得分:2)

在SQLite 3.7.11或更高版本中,非聚合列的值保证来自与单个MIN或MAX匹配的行:

SELECT *, MAX(timestamp)
FROM import
GROUP BY pk1, pk2