反转dask分布式数据帧的简单方法

时间:2019-04-01 23:06:53

标签: python dask

我尝试使用[::-1]颠倒dask数据帧的顺序,但是得到NotImplementedError,只能使用iloc索引,例如[:, ['foo']]

例如

tmp=pd.DataFrame(dict(a=[0,1,1,1,0,1,0,1], b=[0,0,0,0,1,0,0,1]))
tmp=dd.from_pandas(tmp, npartitions=4)
tmp[::-1]

如何在不将整个数据帧加载到内存的情况下轻松颠倒排序后的数据帧的顺序?

2 个答案:

答案 0 :(得分:2)

我终于找到了一种不错的方法,使用整数索引并乘以-1。

tmp=pd.DataFrame(dict(a=[0,1,1,1,0,1,0,1], b=[0,0,0,0,1,0,0,1]))
tmp=dd.from_pandas(tmp, npartitions=4)
tmp=tmp.reset_index()
tmp['index']=tmp['index']*-1
tmp=tmp.set_index('index')
tmp.compute()

答案 1 :(得分:1)

这是使索引不变的解决方案:

@dask.delayed
def reverse_pdf(pdf):
    '''delayed function to reverse a pandas dataframe'''
    return pdf[::-1]

# generating testdata
tmp=pd.DataFrame(dict(a=[0,1,1,1,0,1,0,1], b=[0,0,0,0,1,0,0,1]))
tmp_dd=dd.from_pandas(tmp, npartitions=4)

# reversing tmp_dd
ds = tmp_dd.to_delayed() # one delayed object per partition
ds = [reverse_pdf(d) for d in ds] # reverse each partition
ds = reversed(ds) # reverse the order of the partitions
tmp_dd_reversed = dd.from_delayed(ds) # construct a new dask dataframe