我只是为了好玩而尝试Dask,并掌握了良好的做法。经过一番尝试和错误,我得到了Dask Array的帮助。现在有了Dask DataFrame,我似乎无法在平衡的分布式方案中扩展DataFrame。
这是一个例子,我正在用我的笔记本电脑对小型数据进行一些虚拟测试,该数据有8个工作人员(每个cpu内核1个,每个2GB RAM)。
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client()
df = pd.read_csv('dataset.csv',
sep=';',
index_col=0,
parse_dates=[0],
infer_datetime_format=True)
#The index is datetime.
ddf = dd.from_pandas(df, npartitions=8)
del df
ddf = ddf.persist()
# My workers are equally filled with 76mb
# I'm now adding new cols, from date attributes.
ddf['date'] = ddf.index
ddf["yearly"] = ddf['date'].dt.week
ddf["weekly"] = ddf['date'].dt.weekday
ddf["daily"] = ddf['date'].dt.time.astype('str')
ddf = ddf.drop(labels=['date'], axis=1)
ddf = ddf.persist()
# Ok, now one of the worker contains 130Mb, the others took no load.
# Let's make more col with some one hot encoding on those attributes.
dum = list()
dum.append(ddf['yearly'].unique().compute())
dum.append(ddf['weekly'].unique().compute())
dum.append(ddf['daily'].unique().compute())
for e in dum:
for i in e[1:].index:
ddf['{}_{}'.format(e.name, i)] = (ddf[e.name] == e[i]).astype('int')
ddf = ddf.persist()
在最后一个persist()函数中,我可以观察到只有一个工作线程正在执行cpu活动,并且正在加载其内存(最多337MiB)。
在我看来,我创建的列是从已分区的原始索引逐行推断的。我期望每个分区平均增长,每个分区都使用一个工作人员分配的内存。我在这里错过了什么吗,或者这是Dask DataFrame的限制吗?