当使用tempfile.TemporaryFile
作为中间件将pandas DF写入csv(to_csv
)时,某些行会从DF的末尾静默删除。
丢弃的行数取决于DF的长度和宽度,看似不可预测。
实际上,如果DF足够短,则会删除所有行并生成空文件(没有行写入磁盘)。
证据表明这是Pandas中的一个错误,但我可能在纯Python代码中出错。
请参阅下面的代码和结果(python 3.5,pandas 0.20
import shutil
from io import TextIOWrapper, BufferedWriter
from tempfile import TemporaryFile
import pandas as pd
def pd_bug(size, rep_str):
out_path = "bug_{}_{}".format(rep_str, size)
diags = {"str": [rep_str] * size, "num": [i for i in range(size)]}
df = pd.DataFrame(diags)
with TemporaryFile() as f, BufferedWriter(f) as bw, TextIOWrapper(bw) as iow:
df.to_csv(iow, index=False)
with open(out_path, "w+b") as t:
f.seek(0)
shutil.copyfileobj(f, t)
结果:
pd_bug(100, "abc") # tail -n1 bug_abc_100 -> <EMPTY>
pd_bug(1000, "abc") # tail -n1 bug_abc_1000 -> <EMPTY>
pd_bug(2000, "abc") # tail -n1 bug_abc_2000 -> 1943,abc <57 dropped>
pd_bug(5000, "abc") # tail -n1 bug_abc_5000 -> 4676,abc <324 dropped>
pd_bug(10000, "abc") # tail -n1 bug_abc_10000 -> 9231,abc <769 dropped>
pd_bug(50000, "abc") # tail -n1 bug_abc_50000 -> 49488,abc <512 dropped>
pd_bug(100, "pandas") # tail -n1 bug_pandas_100 -> <EMPTY>
pd_bug(1000, "pandas") # tail -n1 bug_pandas_1000 -> 754,pandas <46 dropped>
pd_bug(2000, "pandas") # tail -n1 bug_pandas_2000 -> 1458,pandas <542 dropped>
pd_bug(5000, "pandas") # tail -n1 bug_pandas_5000 -> 4873,pandas <127 dropped>
pd_bug(10000, "pandas") # tail -n1 bug_pandas_10000 -> 9654,pandas <346 dropped>
pd_bug(50000, "pandas") # tail -n1 bug_pandas_50000 -> 49433,pandas <567 dropped>
这似乎只发生在使用类似文件的对象上;使用标准
df.to_csv("/a/path/ok")
正常工作 - 没有删除行