我有一个约有100000行的表,我想删除重复的记录
为此,我正在使用JOIN
查询
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
但是在执行上述查询时,系统由于Cross JOIN
而卡住了
我该怎么办?
答案 0 :(得分:1)
您正在将INNER
联接用作CROSS
联接。
您的代码应为:
DELETE c1
FROM contacts c1 INNER JOIN contacts c2
ON c1.id > c2.id AND c1.email = c2.email;
您还可以尝试获取每封电子邮件的最小ID并加入表格:
DELETE FROM contacts
WHERE id NOT IN (
SELECT t.id FROM (
SELECT MIN(id) id FROM contacts
GROUP BY email
) t
);
如果列email
上还不存在,则还要创建一个索引。
答案 1 :(得分:0)
尝试使用汇总来获取最小值df_json = spark.read.json("/FileStore/tables/file.json")
df_json.write.parquet("/FileStore/file.parquet")
,然后删除除此以外的所有内容。假设id
是唯一的:
id
这将删除delete c
from contacts c left join
(select c2.email, min(c2.id) as min_id
from contacts c2
group by c2.email
) c2
on c2.min_id = c.id
where c2.id is null;
中不是其电子邮件的最小contacts
的所有ID。它还与id
语句上的MySQL限制兼容。