通过索引分配到dataframe列的Dask抛出ValueError

时间:2018-03-22 09:04:36

标签: python pandas dataframe dask dask-distributed

我有一个按数据框分组的转换管道。所有函数都获得DataframeGroupBy并计算一些功能。然后将这些功能存储在Dataframe中。数据帧的索引是相同的,因为所有要素都是由相同的DataFrameGroupBy对象派生的。功能如下:

def function(group_by_df, features_df=None):
    # actions to perform to group_by_df e.g
    feature_max = group_by_df.column.max() # This is a series object with index the same as group_by_df
    if features_df is not None:
        features_df['feature_name'] = feature_max
    else:
        features_df = feature_max.to_frame(name='feature_name')
    return features_df

因为这是itterative,所以features_df第一次是none,因此创建了数据帧。然后,当执行所有其他迭代时,feature_df具有包含所有先前特征的列。尝试将group_by_df生成的系列分配给feature_df时,我只需一步即可得到以下错误:

ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

奇怪的是运行以下代码:

featues_pandas = features_df.compute()
feature_series_with_issue_pandas = feature_series_with_issue_pandas.compute()
features_pandas['feature_name'] = feature_series_with_issue_pandas

的工作原理。这是隔离失败的功能并尝试将其分配给目前为止创建的数据帧,但在熊猫方面,它可以工作。我可能做错了吗?

添加MCVE:

raw_data = pd.DataFrame({'username':list('ab')*10, 'user_agent': list('cdef')*5, 'method':['POST'] * 20, 'dst_port':[80]*20, 'dst':['1.1.1.1']*20})
past = pd.DataFrame({'user_agent':list('cde'), 'percent':[0,3, 0.3, 0.4]})
dask_raw = dd.from_pandas(raw_data, npartitions=4)
dask_past = dd.from_pandas(past, npartitions=4)
dask_past = dask_past.set_index('user_agent')
merged_raw = dask_raw.merge(dask_past, how='left', left_on='user_agent', right_index=True)
grouped_by_df = merged_raw.groupby(['username', 'dst', 'dst_port'])
feature_one = grouped_by_df.apply(lambda x: 'POST' in x.values, meta=('feature_one', '?'))
features = feature_one.to_frame(name='feature_one')
feature_two = grouped_by_df.percent.min()
feature_two = feature_two.fillna(0)
features['feature_two'] = feature_two 

Traceback (most recent call last):
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-33-92f58e5ed5a0>", line 1, in <module>
   features['feature_two'] = feature_two
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 2319, in __setitem__
    df = self.assign(**{key: value})
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 2498, in assign
    return elemwise(methods.assign, self, *pairs, meta=df2)
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 3028, in elemwise
    args = _maybe_align_partitions(args)
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/multi.py", line 147, in _maybe_align_partitions
    dfs2 = iter(align_partitions(*dfs)[0])
  File "/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/multi.py", line 103, in align_partitions
    raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

0 个答案:

没有答案