Question

我必须每20秒阅读一次CSV。每个CSV包含min。 500到最大60000行。我必须在Postgres表中插入数据，但在此之前我需要检查项是否已经插入，因为很有可能获得重复项。检查唯一性的字段也被编入索引。

因此，我以块的形式读取文件，并使用IN子句获取数据库中已有的项目。

有更好的方法吗？

Answer 1

这应该表现良好：

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

与this answer密切相关。

Answer 2

首先，为了完整性，我改变了Erwin的代码以使用except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
except
select *
from tbl

DROP TABLE tmp;

然后我决定自己测试一下。我在9.1中对它进行了测试，其中大部分都是未触及的postgresql.conf。目标表包含1000万行，原始表包含3万行。目标表中已存在15千个。

create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;

我只询问了选择部分的说明。 except版本：

explain
select *
from tmp
except
select *
from tbl
;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

outer join版本：

explain
select *
from 
    tmp
    left join
    tbl using (id)
where tbl.id is null
;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)

检查Postgres表中是否存在记录

2 个答案: