因此,我试图在内存不足的情况下通过Pandas CSV转储功能将Postgres(在heroku上)的大量数据存储到磁盘上(整个表都无法容纳在内存中)。
我以为我可以一次使用以下代码直接将它直接排成100行到CSV文件中:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
connuri = "MY_DATABASE_CONNECTION_URL"
engine = create_engine(connuri, execution_options={'stream_results': True})
raw_engine = engine.raw_connection()
sql = "SELECT * FROM giant_table;"
firstchunk = True
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=100):
if firstchunk:
chunk.to_csv("bigtable.csv", index=False)
firstchunk = False
else:
chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
主要基于this answer和this one
但是,它仍然没有足够的内存。
从回溯来看,它似乎在正确地传输数据,但是在尝试写入文件时内存不足,即:
Traceback (most recent call last):
File "download_table.py", line 22, in <module>
chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/core/frame.py", line 1745, in to_csv
formatter.save()
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 171, in save
self._save()
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 286, in _save
self._save_chunk(start_i, end_i)
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 313, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 84, in pandas._libs.writers.write_csv_rows
MemoryError
我觉得这很奇怪。我以为追加模式只会将游标粘贴在文件的末尾,而不必将整个内容读入内存,然后将数据插入游标所在的位置。但是也许需要读取整个文件才能做到这一点(?!)。
我尝试将块大小减小到10,并使用engine.connect()
而不是engine.raw_connection()
创建连接,以防万一问题是我之后并没有真正从db流数据所有。也没用。
我还尝试仅打开一个文件句柄并逐块写入,如
with open("attempted_download.csv", "w") as csv:
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
if firstchunk:
mystring = chunk.to_csv(index=False)
csv.write(mystring)
firstchunk = False
else:
mystring = chunk.to_csv(index=False, header=False)
csv.write(mystring)
但相同的内存错误。我在这里想念一些明显的东西吗?
修改
我也只是尝试保存到一堆单独的文件,即:
def countermaker():
count = 0
def counter():
nonlocal count
count += 1
return "partial_csv{}.csv".format(count)
return counter
counter = countermaker()
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
if firstchunk:
chunk.to_csv(counter(), index=False)
firstchunk = False
else:
chunk.to_csv(counter(), index=False, header=False)
并得到了完全相同的错误,尽管它确实设法以这种方式创建了578个文件。