我正在做一个执行以下操作的项目:
copy_from
cmd 源流非常大,因此在所有流中相应地管理内存非常重要。
我被困在两个主要领域:
1)当复制cmd从流中读取数据而不必将全部内容加载到内存中时,确保set_one_stream
和set_two_stream
拥有数据。
2)如何最好地将源流分成两组。
这是我正在尝试的粗略的无效版本:
import gnupg
import psycopg2
from StringIO import StringIO
DELIMITER = '|'
SOURCE_FILE = open('/large.csv.gpg', 'r')
TABLE_ONE = 'set_one'
TABLE_TWO = 'set_two'
SET_ONE_END_COL = 5
SET_TWO_BEGIN_COL = 5
gpg_buffer = StringIO()
set_one_stream = StringIO()
set_two_stream = StringIO()
def write_gpg_data(chunk):
# note: gpg_buffer referenced before assignment,
# even if `global gpg_buffer`
# look for newline in the csv. if exists, read from gpg_buffer
# and concat chunk to build full row
if '\n' in chunk:
index = chunk.index['\n']
gpg_buffer.write(chunk[0:index])
csv_row = self._buffer.getvalue()
csv_reader = csv.reader(csv_row, delimiter=DELIMITER)
# there must be a better way to split each row into two streams?
for row in csv_reader:
set_one_row = '|'.join(row[0:SET_ONE_END_COL]) + '\n'
set_two_row = '|'.join(row[SET_TWO_BEGIN_COL:]) + '\n'
set_one_stream.write(set_one_row)
set_two_stream.write(set_two_row)
gpg_buffer.close()
gpg_buffer = StringIO()
else:
# write partial row until next chunk
gpg_buffer.write(chunk)
gpg_client = gnupg.GPG()
gpg_client.on_data = write_gpg_data
gpg_client.decrypt_file(SOURCE_FILE)
sql_connection = psycopg2.connect(database='csv_test')
with sql_connection as conn:
with conn.cursor() as curs:
# fyi, copy_from requires `read` and `readline` be defined
# on the source streams
curs.copy_from(set_one_stream, TABLE_ONE, delimiter=DELIMITER, null='')
curs.copy_from(set_two_stream, TABLE_TWO, delimiter=DELIMITER, null='')
我想我可能需要进行线程和锁定以某种方式协调数据的写入和读取,但是我似乎无法提出关于如何工作的思维模型。
任何有关如何解决此问题的想法将不胜感激。