我正在使用 Dask 处理几个大型 csv 文件。我需要根据某种条件屏蔽这些文件,并保存被屏蔽的数组没有被屏蔽的行。我目前做的是:
from dask import dataframe as dd
df = dd.read_csv(filename)
msk = ... # some condition
df = df.mask(msk).compute()
df.to_csv("{}_sample.csv".format(filename), index=False)
屏蔽有效,但生成的文件仍包含被屏蔽的行作为空行,即:
...
18.702003,0.005,79.428,9.999001250124936,0.5203728231202968,0.2673634806190893,-0.58664254749603
19.102915,0.069,77.81,9.999238070973211,-0.6233755821087494,0.3886258651317274,-3.88229321744741
,,,,,,,,,,,,
,,,,,,,,,,,,
,,,,,,,,,,,,
,,,,,,,,,,,,
20.388945,0.08199999999999999,77.50999999999998,9.999336227970336,0.35936464745549523,1.23090232
,,,,,,,,,,,,
...
我查看了 to_csv
函数,但看不到删除这些空行/掩码行的选项。
答案 0 :(得分:1)
无需调用 .compute()
(并将屏蔽的数据帧放入内存中)。您可以使用标准的 Pandas 语法对相关行进行子集化。
df = df[~msk] # no need to call compute here (this is to drop masked rows)
df.to_csv("{}_sample_*.csv".format(filename), index=False) # the * is needed for multiple partitions