Question

尝试在dask中使用pivot_table，同时保持已排序的索引。我有一个简单的pandas数据框，看起来像这样：

# make dataframe, fist in pandas and then in dask
df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a','b', 'c'], 'dist': [0, .1, .2, .1, 0, .3, .4, .1,  0]})

df.sort_values(by='A', inplace=True)
dd = dask.dataframe.from_pandas(df, chunksize=3)  # just for demo's sake, you obviously don't ever want a chunksize of 3
print(dd.known_divisions)  # Here I get True, which means my data is sorted

# now pivot and see if the index remains sorted
dd = dd.categorize('B')
pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
print(pivot_dd.known_divisions) # Here I get False, which makes me sad

我很想找到一种让pivot_dd有一个排序索引的方法，但我在dask中看不到sort_index方法，并且不能将'A'设置为索引w / out得到一个键错误（它已经是索引！）。

在这个玩具示例中，我可以首先转动pandas表然后排序。我想到的真正的应用程序将不允许我这样做。

提前感谢您的任何帮助/建议。

Answer 1

这可能不是你想要的，也许甚至不是最好的答案，但它似乎确实有用。第一个问题是，pivot操作为列创建了一个分类索引，这很烦人。您可以执行以下操作。

>>> pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
>>> pivot_dd.columns = list(pivot_dd.columns)
>>> pivot_dd = pivot_dd.reset_index().set_index('A', sorted=True)
>>> pivot_dd.known_divisions
True

如何按照pivot_table

1 个答案: