如何在Dask中进行分组过滤

时间:2019-03-21 21:20:24

标签: dask

我正在尝试获取一个淡淡的数据帧,并按列“ A”分组,并删除行数少于MIN_SAMPLE_COUNT个的组。

例如,以下代码适用于熊猫:

import pandas as pd
import dask as da

MIN_SAMPLE_COUNT = 1

x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']

grouped = x.groupby('A')
x = grouped.filter(lambda x: x['A'].count().astype(int) > MIN_SAMPLE_COUNT)

但是,在达斯克,如果我尝试类似的尝试:

import pandas as pd
import dask

MIN_SAMPLE_COUNT = 1

x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']

x = dask.dataframe.from_pandas(x, npartitions=2)

grouped = x.groupby('A')
x = grouped.filter(lambda x: x['A'].count().astype(int) > MIN_SAMPLE_COUNT)

我收到以下错误消息:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getattr__(self, key)
   1162         try:
-> 1163             return self[key]
   1164         except KeyError as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getitem__(self, key)
   1153         # error is raised from pandas
-> 1154         g._meta = g._meta[key]
   1155         return g

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in __getitem__(self, key)
    274             if key not in self.obj:
--> 275                 raise KeyError("Column not found: {key}".format(key=key))
    276             return self._gotitem(key, ndim=1)

KeyError: 'Column not found: filter'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-55-d8a969cc041b> in <module>()
      1 # Remove sixty second blocks that have fewer than MIN_SAMPLE_COUNT samples.
      2 grouped = dat.groupby('KPI_60_seconds')
----> 3 dat = grouped.filter(lambda x: x['KPI_60_seconds'].count().astype(int) > MIN_SAMPLE_COUNT)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getattr__(self, key)
   1163             return self[key]
   1164         except KeyError as e:
-> 1165             raise AttributeError(e)
   1166 
   1167     @derived_from(pd.core.groupby.DataFrameGroupBy)

AttributeError: 'Column not found: filter'

该错误消息表明在Dask中尚未实现在Pandas中使用的过滤器方法(搜索后我也没有找到它)。

是否有Dask功能可以捕获我要执行的操作?我已经通过了Dask API,但没有什么比我需要的更突出。我当前正在使用Dask'1.1.1'

谢谢您的帮助。

1 个答案:

答案 0 :(得分:0)

Dask我本人还很陌生。实现您正在尝试的一种方法如下:

黄昏版本:0.17.3

import pandas as pd
import dask.dataframe as dd

MIN_SAMPLE_COUNT = 1

x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']
print("x (before):")
print(x)  # still pandas
x = dd.from_pandas(x, npartitions=2)

grouped = x.groupby('A').B.count().reset_index()

grouped = grouped.rename(columns={'B': 'Count'})

y = dd.merge(x, grouped, on=['A'])
y = y[y.Count > MIN_SAMPLE_COUNT]
x = y[['A', 'B', 'C']]
print("x (after):")
print(x.compute())  # needs compute for conversion to pandas df

输出:

x (before):
   A  B  C
0  1  2  3
1  1  5  6
2  2  8  9
3  1  3  5
x (after):
   A  B  C
0  1  2  3
1  1  5  6
1  1  3  5