根据字段

时间:2018-04-24 00:52:01

标签: sql data-warehouse

我有一个包含3个关键列的表,称为services(service1,service2,service3)和其他值列。我想基于3个关键字段的组合(以任何顺序)从表中删除所有重复记录。例如关键字段的记录'汽车,卡车,自行车'和'自行车,汽车,卡车'尽管字段值的位置是重复的记录。 注意:在评论中编辑了我的答案以获得更详细的陈述。

1 个答案:

答案 0 :(得分:0)

听起来好像桌子的设计很差,所以我会考虑完全重构。

但是为了处理它(并且不使用游标),我认为最快的方法是列出每个可能的排列以找到重复项,然后分配行号。

示例:

数字1 6 3有6种排列:

123, 132, 213, 231, 312, 321

同样适合你的自行车' '车' '卡车':

'bike' 'car' 'truck', 'car' 'bike' 'truck', ... etc.

因此,我们希望将表中的数据分区为重复组(基于所有可能的排列),并为分区中的每一行分配行号。

Click here for a working example in SqlFiddle

示例表和数据:

CREATE TABLE services
  (  service1 VARCHAR(10),
     service2 VARCHAR(10),
     service3 VARCHAR(10) 
  ); 

--these first three values duplicate each other. They should end up 
--partitioned together in our query
INSERT INTO services VALUES ('bike', 'car', 'truck');
INSERT INTO services VALUES ('truck', 'bike', 'car');
INSERT INTO services VALUES ('car', 'truck', 'bike');
--this fourth value should be in a partition on it's own
INSERT INTO services VALUES ('moped', 'car', 'truck');

运行此查询以查看分区的结果。这实质上是说为所有行创建一个分区,其中三列等于其自身的不同排列:

SELECT s.*,
       Row_number() over(PARTITION BY (SELECT DISTINCT 1
                                       FROM   services s1
                                       WHERE (    s1.service1 = s.service1
                                              AND s1.service2 = s.service3
                                              AND s1.service3 = s.service2)
                                          OR (    s1.service1 = s.service2
                                              AND s1.service2 = s.service1
                                              AND s1.service3 = s.service3)
                                          OR (    s1.service1 = s.service2
                                              AND s1.service2 = s.service3
                                              AND s1.service3 = s.service1)
                                          OR (    s1.service1 = s.service3
                                              AND s1.service2 = s.service1
                                              AND s1.service3 = s.service2)
                                          OR (    s1.service1 = s.service3
                                              AND s1.service2 = s.service2
                                              AND s1.service3 = s.service1) )
                       ORDER BY (null)) AS rownumber
FROM     services s;

现在您已收到结果,您可以看到您需要删除rownumber大于1的任何行:

DELETE
FROM (SELECT s.*,
             Row_number() over(PARTITION BY (SELECT DISTINCT 1
                                             FROM   services s1
                                             WHERE (    s1.service1 = s.service1
                                                    AND s1.service2 = s.service3
                                                    AND s1.service3 = s.service2)
                                                OR (    s1.service1 = s.service2
                                                    AND s1.service2 = s.service1
                                                    AND s1.service3 = s.service3)
                                                OR (    s1.service1 = s.service2
                                                    AND s1.service2 = s.service3
                                                    AND s1.service3 = s.service1)
                                                OR (    s1.service1 = s.service3
                                                    AND s1.service2 = s.service1
                                                    AND s1.service3 = s.service2)
                                                OR (    s1.service1 = s.service3
                                                    AND s1.service2 = s.service2
                                                    AND s1.service3 = s.service1) )
                               ORDER BY (null)) AS rownumber
   FROM     services s )
WHERE rownumber > 1;

旁注:我是为Oracle写的。我从未使用过Teradata,所以他们可能有不同的分工方式。见http://www.bikinfo.com/HTML/TD/TD_vs_Oracle.html#_Toc_Qualify