Question

我有一个大的（约6千万行）固定宽度源文件，每行约1800条记录。

我需要将这个文件加载到Postgres 8.3.9的一个实例上的5个不同的表中。

我的困境是，因为文件太大，我只想阅读一次。

这很简单，使用INSERT或COPY正常，但我试图通过在包含TRUNCATE的事务中包含我的COPY FROM语句来提高加载速度 - 避免记录，这应该会给加载速度提升（根据http://www.cirrusql.com/node/3）。据我了解，您可以禁用Postgres 9.x中的日志记录 - 但我在8.3.9上没有该选项。

下面的脚本让我读取输入文件两次，我想避免...有关如何通过只读取一次输入文件来实现此目的的任何想法？不必是bash - 我也尝试过使用psycopg2，但无法弄清楚如何将文件输出流式传输到COPY语句中，如下所示。我无法复制文件，因为我需要动态解析它。

#!/bin/bash

table1="copytest1"
table2="copytest2"

#note: $1 refers to the first argument used when invoking this script
#which should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables

( echo 'BEGIN;'
  echo 'TRUNCATE TABLE ' ${table1} ';'
  echo 'COPY ' ${table1} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table1}.py 
  echo '\.'
  echo 'TRUNCATE TABLE ' ${table2} ';'
  echo 'COPY ' ${table2} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table2}.py 
  echo '\.'
  echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name

exit 0

谢谢！

Answer 1

为什么在第二张桌子上使用COPY？我会假设做了：

INSERT INTO table2 (...)
SELECT ...
FROM table1;

比使用COPY更快。

修改
如果您需要将不同的行导入到不同的表中，但需要从同一个源文件导入，可能会将所有内容插入到临时表中，然后将行从那里插入到目标表中更快：

将。整个 *文本文件导入一个临时表：

COPY staging_table FROM STDIN ...;

在该步骤之后，整个输入文件位于staging_table

然后通过仅选择那些符合相应表的条件，将登台表中的行复制到各个目标表：

INSERT INTO table_1 (...) SELECT ... FROM staging_table WHERE (conditions for table_1); INSERT INTO table_2 (...) SELECT ... FROM staging_table WHERE (conditions for table_2);

当然，只有在数据库中有足够的空间来保存临时表时，这才是可行的。

Answer 2

您可以使用named pipes代替您的匿名管道。使用此概念，您的python脚本可以通过不同的psql进程使用相应的数据填充表。

创建管道：

mkfifo fifo_table1
mkfifo fifo_table2

运行psql实例：

psql db_name < fifo_table1 &
psql db_name < fifo_table2 &

你的python脚本会看起来如此（Pseudocode）：

SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')

bigfile = open('mybigfile', 'r')

print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax     
print >> fifo2, SQL_BEGIN % ('table2', 'table2')      

for line in bigfile:
  # your code, which decides where the data belongs to
  # if data belongs to table1
  print >> fifo1, data
  # else
  print >> fifo2, data

print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'

fifo1.close()
fifo2.close()

也许这不是最优雅的解决方案，但它应该有效。

使用COPY FROM stdin加载表，只读取一次输入文件

2 个答案: