我正在使用DASK读取大小约为2GB的CSV文件。 我想根据以下某种哈希函数将其每一行写成255个CSV文件。
from dask import dataframe as dd
if __name__ == '__main__':
df = dd.read_csv('train.csv', header=None, dtype='str')
df = df.fillna()
for _, line in df.iterrows():
number = hash(line[2]) % 256
with open("{}.csv".format(number), 'a+') as f:
f.write(', '.join(line))
这种方式大约需要15分钟。有什么方法可以更快地做到这一点。
答案 0 :(得分:2)
Since your procedure is dominated by IO, it is very unlikely that Dask would do anything but add overhead in this case, unless your hash function is really really slow. I assume that is not the case.
@zwer 's solution would look something like
files = [open("{}.csv".format(number), 'a+') for number in range(255)]
for _, line in df.iterrows():
number = hash(line[2]) % 256
files[number].write(', '.join(line))
[f.close() for f in files]
However, your data appears to fit in memory, so you may find much better performance
for (number, group) in df.groupby(df.iloc[:, 2].map(hash)):
group.to_csv("{}.csv".format(number))
because you write to each file continuously rather than jumping between them. Depending on your IO device and buffering, the difference can be none or huge.