这是我的sql数据
id location1 location2 distance
--------------------------------------------
1 Paris Marseille 150km
2 Paris Lyon 200km
3 Paris Strasbourg 300km
4 Paris Toulouse 350km
5 Marseille Paris 150km <-(almost) duplicate of row 1
6 Marseille Lyon 250km
...
因为巴黎之间的距离 - &gt;马赛等于马赛 - &gt;巴黎我想删除其中一个重复的行。
表包含近1M行,其中一半是重复的。如何在这么大的表上删除这些重复数据。
答案 0 :(得分:2)
在这种情况下,您可以自己加入表:
DELETE FROM city WHERE id IN (
SELECT id
FROM city c1, city c2
WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
AND c1.id < c2.id)
我假设您的表名为city
正如miszyman所指出的,避免子查询更有效:
DELETE c1
FROM city c1, city c2
WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
AND c1.id < c2.id
答案 1 :(得分:1)
如果您只是选择其中location1&lt; LOCATION2
答案 2 :(得分:0)
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id INT NOT NULL
,location1 varchar(20) not null
,location2 varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);
INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);
DELETE x
FROM my_table x
JOIN my_table y
ON y.location2 = x.location1
AND y.location1 = x.location2
AND y.distance = x.distance
AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)
SELECT *
FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2 | distance |
+----+-----------+------------+----------+
| 6 | Marseille | Lyon | 250 |
| 2 | Paris | Lyon | 200 |
| 1 | Paris | Marseille | 150 |
| 3 | Paris | Strasbourg | 300 |
| 4 | Paris | Toulouse | 350 |
+----+-----------+------------+----------+
答案 3 :(得分:0)
如果有一半(或接近一半)重复,我会创建一个临时表并重新插入数据:
create temporary table tempt as
select location1, location2, distance
from mydata t
where location1 < location2
union all
select location1, location2, distance
from mydata t
where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);
truncate table mydata;
insert into mydata(location1, location2, distance)
select location1, location2, distance
from tempt;
为了提高性能,您需要mydata(location1, location2)
上的索引:
create index idx_mydata_location1_location2 on mydata(location1, location2)