我想要加载到Postgres中的大量数据(100 GB)。我一直在阅读文档,它建议删除索引和外键。
http://www.postgresql.org/docs/current/interactive/populate.html
我希望对表中的字段有一些独特的约束(即3列是唯一的)。如何加载?
我可以看到一些不同的选择:
A)通过Python或类似的东西正常加载它(慢 - 可能不值得做)。
B)获取唯一约束,加载数据,重新应用约束(在这种情况下,当存在重复时会发生什么?)
C)将数据加载到临时表中(没有唯一约束)。在SQL中做一些聪明的事情来删除重复项,并将结果复制到主表中。
答案 0 :(得分:3)
您可以使用pg_bulkload加载它。 pg_buldload支持直接加载数据,不需要编写共享缓冲区,并支持并行。比未记录的表快得多。 你可以先创建唯一约束,然后使用pg_bulkload它,pg_bulkload可以将错误行记录到logfile中并纠正正确的行加载。你可以在加载后处理错误。 对于exp:
wget http://pgfoundry.org/frs/download.php/3566/pg_bulkload-3.1.5.tar.gz
[root@db-172-16-3-150 ~]# export PATH=/home/pg93/pgsql9.3.3/bin:$PATH
[root@db-172-16-3-150 ~]# cd /opt/soft_bak/pg_bulkload-3.1.5
[root@db-172-16-3-150 pg_bulkload-3.1.5]# which pg_config
/home/pg93/pgsql9.3.3/bin/pg_config
[root@db-172-16-3-150 pg_bulkload-3.1.5]# make
[root@db-172-16-3-150 pg_bulkload-3.1.5]# make install
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# truncate test;
TRUNCATE TABLE
digoal=# create extension pg_bulkload;
pg_bulkload -i /ssd3/pg93/test.dmp -O test -l /ssd3/pg93/test.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -h $PGDATA -p $PGPORT -d $PGDATABASE
[root@db-172-16-3-150 pg93]# cat test.log
pg_bulkload 3.1.5 on 2014-03-28 13:32:31.32559+08
INPUT = /ssd3/pg93/test.dmp
PARSE_BADFILE = /ssd4/pg93/pg_root/pg_bulkload/20140328133231_digoal_public_test.prs.dmp
LOGFILE = /ssd3/pg93/test.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = ,
QUOTE = "\""
ESCAPE = "\""
NULL =
OUTPUT = public.test
MULTI_PROCESS = YES
VERBOSE = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /ssd4/pg93/pg_root/pg_bulkload/20140328133231_digoal_public_test.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = NO
0 Rows skipped.
50000000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
Run began on 2014-03-28 13:32:31.32559+08
Run ended on 2014-03-28 13:35:13.019018+08
CPU 1.55s/128.55u sec elapsed 161.69 sec