我最近用postgres创建了一个数据仓库。在一个特定的表中,我总共加载了29个Mi行。
我正在尝试通过生成的MD5识别相同的行。问题在于,要花超过一天的时间来处理和消除重复项。 所有使用的列都带有索引。
查询:
DELETE FROM
elos_sched_2 es
WHERE
ES.SCHED_ID IN
( SELECT
ELOS_SCHED_2
GROUP BY
HASHID
HAVING
COUNT(1) > 1 )
这是查询生成的“ EXPLAIN”:
Delete on elos_sched_2 es (cost=7190318.45..7191769.30 rows=11673374 width=38)
-> Nested Loop (cost=7190318.45..7191769.30 rows=11673374 width=38)
-> HashAggregate (cost=7190317.88..7190319.88 rows=200 width=40)
Group Key: "ANY_subquery".min
-> Subquery Scan on "ANY_subquery" (cost=6618114.99..7152680.62 rows=15054907 width=40)
-> GroupAggregate (cost=6618114.99..7002131.55 rows=15054907 width=41)
Group Key: elos_sched_2.hashid
Filter: (count(1) > 1)
-> Sort (cost=6618114.99..6676481.86 rows=23346749 width=41)
Sort Key: elos_sched_2.hashid
-> Seq Scan on elos_sched_2 (cost=0.00..1606287.49 rows=23346749 width=41)
-> Index Scan using idx_sched_id_elos_sched_2 on elos_sched_2 es (cost=0.56..8.58 rows=1 width=14)
Index Cond: (sched_id = "ANY_subquery".min)
这个结果有没有机会可以看到?
谢谢!
答案 0 :(得分:0)
这将更快。首先提取SCHED_ID并对其进行材质化,然后删除它们。
如果您的Postgres版本低于12,则应从查询中删除MATERIALIZED
,因为CTE始终会实现。
with MATERIALIZED delete_list(id_to_delete) as
(
select MIN(SCHED_ID)
from elos_sched_2
group by HASHID
having COUNT(1) > 1
)
delete from elos_sched_2
where SCHED_ID in (select id_to_delete from delete_list);
修改
顺便说一句,如果每个hashid
有多个副本怎么办?查询逻辑应该反转。
with MATERIALIZED keep_list(id_to_keep) as
(
select MAX(sched_id)
from elos_sched_2
group by hashid
)
delete from elos_sched_2
where sched_id NOT in (select id_to_keep from keep_list);