我必须每20秒阅读一次CSV。每个CSV包含min。 500到最大60000行。我必须在Postgres表中插入数据,但在此之前我需要检查项是否已经插入,因为很有可能获得重复项。检查唯一性的字段也被编入索引。
因此,我以块的形式读取文件,并使用IN子句获取数据库中已有的项目。
有更好的方法吗?
答案 0 :(得分:3)
这应该表现良好:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
与this answer密切相关。
答案 1 :(得分:1)
首先,为了完整性,我改变了Erwin的代码以使用except
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
except
select *
from tbl
DROP TABLE tmp;
然后我决定自己测试一下。我在9.1中对它进行了测试,其中大部分都是未触及的postgresql.conf
。目标表包含1000万行,原始表包含3万行。目标表中已存在15千个。
create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;
我只询问了选择部分的说明。 except
版本:
explain
select *
from tmp
except
select *
from tbl
;
QUERY PLAN
----------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..270098.68 rows=200 width=4)
-> Append (cost=0.00..245018.94 rows=10031897 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..771.40 rows=31920 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..244247.54 rows=9999977 width=4)
-> Seq Scan on tbl (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)
outer join
版本:
explain
select *
from
tmp
left join
tbl using (id)
where tbl.id is null
;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..208142.58 rows=15960 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Index Scan using tbl_pkey on tbl (cost=0.00..7.80 rows=1 width=4)
Index Cond: (tmp.id = id)
(4 rows)