ValueError:并非所有分区都已知,无法在dask数据帧上对齐分区错误

时间:2017-07-11 09:32:02

标签: python dataframe dask dask-distributed

我有以下pandas数据框,包含以下列

user_id user_agent_id requests

所有列都包含整数。我不想对它们执行一些操作并使用dask数据帧运行它们。这就是我的工作。

user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
    .groupby(['user_id', 'user_agent_id']) \
    .size().to_frame(name='appearances') \
    .reset_index() # I am not sure I can run this on dask dataframe

user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
    .apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group

但是我收到以下错误

raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

我做错了吗?在纯大熊猫中,它的效果很好但很多行都很慢(虽然它们适合内存)所以我想并行计算。

1 个答案:

答案 0 :(得分:0)

创建dask dataframe时,请添加reset_index()

user_profile_ddf = df.from_pandas(user_profile, npartitions=4).reset_index()