通过Pandas将大数据流式传输到CSV

时间:2019-01-20 21:50:24

标签: python postgresql pandas csv

因此,我试图在内存不足的情况下通过Pandas CSV转储功能将Postgres(在heroku上)的大量数据存储到磁盘上(整个表都无法容纳在内存中)。

我以为我可以一次使用以下代码直接将它直接排成100行到CSV文件中:

import psycopg2
import pandas as pd
from sqlalchemy import create_engine

connuri = "MY_DATABASE_CONNECTION_URL"

engine = create_engine(connuri, execution_options={'stream_results': True})
raw_engine = engine.raw_connection()

sql = "SELECT * FROM giant_table;"

firstchunk = True

for chunk in pd.read_sql_query(sql, raw_engine, chunksize=100):
    if firstchunk:
        chunk.to_csv("bigtable.csv", index=False)
        firstchunk = False
    else:
        chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)

主要基于this answerthis one

但是,它仍然没有足够的内存。

从回溯来看,它似乎在正确地传输数据,但是在尝试写入文件时内存不足,即:

Traceback (most recent call last):
  File "download_table.py", line 22, in <module>
    chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 171, in save
    self._save()
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 286, in _save
    self._save_chunk(start_i, end_i)
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 313, in _save_chunk
    self.cols, self.writer)
  File "pandas/_libs/writers.pyx", line 84, in pandas._libs.writers.write_csv_rows
MemoryError

我觉得这很奇怪。我以为追加模式只会将游标粘贴在文件的末尾,而不必将整个内容读入内存,然后将数据插入游标所在的位置。但是也许需要读取整个文件才能做到这一点(?!)。

我尝试将块大小减小到10,并使用engine.connect()而不是engine.raw_connection()创建连接,以防万一问题是我之后并没有真正从db流数据所有。也没用。

我还尝试仅打开一个文件句柄并逐块写入,如

with open("attempted_download.csv", "w") as csv:
    for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
        if firstchunk:
            mystring = chunk.to_csv(index=False)
            csv.write(mystring)
            firstchunk = False
        else:
            mystring = chunk.to_csv(index=False, header=False)
            csv.write(mystring)

但相同的内存错误。我在这里想念一些明显的东西吗?

修改

我也只是尝试保存到一堆单独的文件,即:

def countermaker():
    count = 0
    def counter():
        nonlocal count
        count += 1
        return "partial_csv{}.csv".format(count)
    return counter

counter = countermaker()

for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
    if firstchunk:
        chunk.to_csv(counter(), index=False)
        firstchunk = False
    else:
        chunk.to_csv(counter(), index=False, header=False)

并得到了完全相同的错误,尽管它确实设法以这种方式创建了578个文件。

0 个答案:

没有答案