Question

我认识到to_csv的写入性能变化很大，因此我进行了一些调查，结果发现，索引结构对性能的影响似乎很大，甚至与索引列的类型没有太大关系。

如果索引是多级索引，则比索引仅由一列组成的写入要慢得多。从这个观察，我现在不太确定，是否应该在编写数据之前就从数据帧中删除索引（以防它们是多索引），或者是否有更好的方法来处理它。

欢迎任何评论或建议。

在这里使用大熊猫0.24.2在我的机器上获得了1000万条记录的测试结果（有关测试集，请参见下文）。请注意，列数和我写入的数据始终相同。唯一更改的是索引中的哪些列以及常规列：

# Writing my DataFrame indexed on an int64 value
test_write(df.set_index('no'), '/scratch/work/testwrite.csv') 
# takes: 0:00:27.557405

# Writing my DataFrame indexed on a string column
test_write(df.set_index('string') , '/scratch/work/testwrite.csv') 
# takes: 0:00:29.037307

# Writing my DataFrame multi-indexed on a string + int64 column
test_write(df.set_index(['string', 'no']), '/scratch/work/testwrite.csv') 
# takes 0:21:55.131867  ~ 44 times more!
# btw, if the no column comes first, it takes around the same time (guess the differences are just fluctuations)

# Writing my DataFrame multi-indexed on two int64 columns
test_write(df.set_index(['no', 'subno']), '/scratch/work/testwrite.csv') 
# takes 0:22:31.104502 (so it's not the fault of the string column)

# Writing my DataFrame multi-indexed on a category column based on a string value + int64 column
df2= pd.DataFrame(df, copy=True)
df2['string']= df2['string'].astype('category')
test_write(df2.set_index(['string', 'no']), '/scratch/work/testwrite.csv') 
# takes 0:23:05.459367 (so category doesn't make a difference here)

测试集/代码。请注意，生成的csv占用的磁盘空间不到200 MB。

import numpy as np
import pandas as pd
import itertools as it
from random import choices
from datetime import datetime

pd.__version__
# 0.24.2

main_len= 100000
sub_len= 100
frame_len=main_len * sub_len
#data=dict()
df= pd.DataFrame()
df['no']=    range(frame_len)
df['subno']= list(range(sub_len)) * main_len
df['string']= choices(['john', 'paul', 'ringo', 'george'], k=frame_len)
df['rand1']= np.random.randint(1, 100, frame_len)
#np.random.randn(frame_len)

def test_write(df, file):
    before= datetime.now()
    df.to_csv('/scratch/work/testwrite.csv', chunksize=1000)
    after= datetime.now()
    print(after-before)

In [45]: df.dtypes
Out[45]: 
no         int64
subno      int64
string    object
rand1      int64
dtype: object
In [46]: df.index.names
Out[46]: FrozenList([None])

多索引数据帧的pandas to_csv性能

0 个答案: