在多列中查找重复的行

时间:2014-11-28 17:21:13

标签: php mysql sql

这是我的sql数据

id      location1    location2      distance
--------------------------------------------
1       Paris        Marseille      150km
2       Paris        Lyon           200km
3       Paris        Strasbourg     300km
4       Paris        Toulouse       350km
5       Marseille    Paris          150km  <-(almost) duplicate of row 1
6       Marseille    Lyon           250km
...

因为巴黎之间的距离 - &gt;马赛等于马赛 - &gt;巴黎我想删除其中一个重复的行。

表包含近1M行,其中一半是重复的。如何在这么大的表上删除这些重复数据。

4 个答案:

答案 0 :(得分:2)

在这种情况下,您可以自己加入表:

DELETE FROM city WHERE id IN (
  SELECT id
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id)

我假设您的表名为city

正如miszyman所指出的,避免子查询更有效:

  DELETE c1
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id

答案 1 :(得分:1)

如果您只是选择其中location1&lt; LOCATION2

答案 2 :(得分:0)

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id      INT NOT NULL
,location1    varchar(20) not null
,location2      varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);

INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);

DELETE x 
  FROM my_table x 
  JOIN my_table y 
    ON y.location2 = x.location1 
   AND y.location1 = x.location2 
   AND y.distance = x.distance 
   AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)

SELECT * 
  FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2  | distance |
+----+-----------+------------+----------+
|  6 | Marseille | Lyon       |      250 |
|  2 | Paris     | Lyon       |      200 |
|  1 | Paris     | Marseille  |      150 |
|  3 | Paris     | Strasbourg |      300 |
|  4 | Paris     | Toulouse   |      350 |
+----+-----------+------------+----------+

答案 3 :(得分:0)

如果有一半(或接近一半)重复,我会创建一个临时表并重新插入数据:

create temporary table tempt as
    select location1, location2, distance
    from mydata t
    where location1 < location2
    union all
    select location1, location2, distance
    from mydata t
    where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);

truncate table mydata;

insert into mydata(location1, location2, distance)
    select location1, location2, distance
    from tempt;

为了提高性能,您需要mydata(location1, location2)上的索引:

create index idx_mydata_location1_location2 on mydata(location1, location2)