我认识到to_csv
的写入性能变化很大,因此我进行了一些调查,结果发现,索引结构对性能的影响似乎很大,甚至与索引列的类型没有太大关系。
如果索引是多级索引,则比索引仅由一列组成的写入要慢得多。从这个观察, 我现在不太确定,是否应该在编写数据之前就从数据帧中删除索引(以防它们是多索引),或者是否有更好的方法来处理它。
欢迎任何评论或建议。
在这里使用大熊猫0.24.2在我的机器上获得了1000万条记录的测试结果(有关测试集,请参见下文)。请注意,列数和我写入的数据始终相同。唯一更改的是索引中的哪些列以及常规列:
# Writing my DataFrame indexed on an int64 value
test_write(df.set_index('no'), '/scratch/work/testwrite.csv')
# takes: 0:00:27.557405
# Writing my DataFrame indexed on a string column
test_write(df.set_index('string') , '/scratch/work/testwrite.csv')
# takes: 0:00:29.037307
# Writing my DataFrame multi-indexed on a string + int64 column
test_write(df.set_index(['string', 'no']), '/scratch/work/testwrite.csv')
# takes 0:21:55.131867 ~ 44 times more!
# btw, if the no column comes first, it takes around the same time (guess the differences are just fluctuations)
# Writing my DataFrame multi-indexed on two int64 columns
test_write(df.set_index(['no', 'subno']), '/scratch/work/testwrite.csv')
# takes 0:22:31.104502 (so it's not the fault of the string column)
# Writing my DataFrame multi-indexed on a category column based on a string value + int64 column
df2= pd.DataFrame(df, copy=True)
df2['string']= df2['string'].astype('category')
test_write(df2.set_index(['string', 'no']), '/scratch/work/testwrite.csv')
# takes 0:23:05.459367 (so category doesn't make a difference here)
测试集/代码。请注意,生成的csv占用的磁盘空间不到200 MB。
import numpy as np
import pandas as pd
import itertools as it
from random import choices
from datetime import datetime
pd.__version__
# 0.24.2
main_len= 100000
sub_len= 100
frame_len=main_len * sub_len
#data=dict()
df= pd.DataFrame()
df['no']= range(frame_len)
df['subno']= list(range(sub_len)) * main_len
df['string']= choices(['john', 'paul', 'ringo', 'george'], k=frame_len)
df['rand1']= np.random.randint(1, 100, frame_len)
#np.random.randn(frame_len)
def test_write(df, file):
before= datetime.now()
df.to_csv('/scratch/work/testwrite.csv', chunksize=1000)
after= datetime.now()
print(after-before)
In [45]: df.dtypes
Out[45]:
no int64
subno int64
string object
rand1 int64
dtype: object
In [46]: df.index.names
Out[46]: FrozenList([None])