标准化后压缩表格

时间:2011-09-19 16:59:08

标签: mysql database-normalization

我最近提高了数据库中的规范化水平,如下所示:

+--------------------------------------+
| state_changes                        |
+----+-------+-----------+------+------+
| ID | Name  | Timestamp | Val1 | Val2 |
+----+-------+-----------+------+------+
| 0  | John  | 17:19:01  |  A   |  X   |
| 1  | Bob   | 17:19:02  |  E   |  W   |
| 2  | John  | 17:19:05  |  E   |  Y   |
| 3  | John  | 17:19:06  |  B   |  Y   |
| 4  | John  | 17:19:12  |  C   |  Z   |
| 5  | John  | 17:19:15  |  A   |  Z   |
+----+-------+-----------+------+------+

更像这样的事情:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------------------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 2  | John  | 17:19:05  |  E   |   | 2  | John  | 17:19:05  |  Y   |
| 3  | John  | 17:19:06  |  B   |   | 3  | John  | 17:19:06  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   | 5  | John  | 17:19:15  |  Z   |
+----+-------+-----------+------+   +----+-------+-----------+------+

我现在怎样才能编写一个查询来“压缩”两个结果表,其中值是重复的?

  • 我想在考虑行唯一性时忽略ID字段;
  • 我想在考虑行唯一性时忽略Timestamp;
  • 但是字段必须是连续的(在Name,Timestamp排序下)才能被认为是重复的。

在此示例中,结果应为:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------+-----------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 3  | John  | 17:19:06  |  B   |   | 2  | John  | 17:19:05  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   +----+-------+-----------+------+
+----+-------+-----------+------+

我的桌子有几十亿行,所以我正在寻找能够考虑效率的东西;那说,我是一个现实的人,所以如果需要,我很高兴查询需要一两个小时才能运行(包括索引重建)。

2 个答案:

答案 0 :(得分:1)

我在MySQL 5.1.58上试过这个,它似乎与您的测试数据一起使用。

SET @name = NULL;
SET @val1 = NULL;

UPDATE state_changes_1
SET Val1 = IF(Name=@name AND Val1=@val1, NULL, (@val1:=Val1)),
    Name = (@name:=Name)
ORDER BY Name, `Timestamp`;

DELETE FROM state_changes_1 WHERE Val1 IS NULL;

答案 1 :(得分:0)

您的问题是关系代数中不存在“顺序”或连续重复的概念,因此无法在sql中执行此操作。您可以通过

轻松获取每个州的最新时间戳
SELECT id, name, MAX(timestamp) ts , state FROM states
GROUP BY id, name, state
ORDER BY ts

但是,您可以通过将表转储到文本文件中来执行您想要的操作并执行一个简单的脚本,其中您可以使用perl,ruby python等语言。即使在一百万行表中也可以安静地完成快速