Dask groupby独特的框架-如何?

时间:2018-12-29 12:55:15

标签: python-3.x pandas pandas-groupby dask

我有几个数据框:

import pandas as pd
import numpy as np

router = pd.DataFrame([
    ['2018-01-01 00:00:00', '1', 5],
    ['2018-01-01 00:30:00', '1', 7],
    ['2018-01-01 01:00:00', '1', 25],
    ['2018-01-01 01:30:00', '1', 3],
    ['2018-01-01 00:00:00', '2', 25],
    ['2018-01-01 00:30:00', '2', 7],
    ['2018-01-01 01:00:00', '2', 25],
    ['2018-01-01 01:30:00', '2', 35],
], columns=['time', 'cust_id', 'errors'])
router

enter image description here

devices = pd.DataFrame([
    ['2018-01-01 00:00:00', '1', 'dev_1'],
    ['2018-01-01 00:30:00', '1', 'dev_1'],
    ['2018-01-01 00:30:00', '1', 'dev_2'],
    ['2018-01-01 01:00:00', '1', 'dev_1'],
    ['2018-01-01 01:00:00', '1', 'dev_2'],
    ['2018-01-01 01:00:00', '1', 'dev_3'],
    ['2018-01-01 01:30:00', '1', 'dev_2'],
    ['2018-01-01 00:00:00', '2', 'dev_1'],
    ['2018-01-01 00:00:00', '2', 'dev_2'],
    ['2018-01-01 00:30:00', '2', 'dev_1'],
    ['2018-01-01 01:00:00', '2', 'dev_2'],
    ['2018-01-01 01:00:00', '2', 'dev_3'],
    ['2018-01-01 01:30:00', '2', 'dev_2'],
    ['2018-01-01 01:30:00', '2', 'dev_4'],
], columns=['time', 'cust_id', 'device_id'])
devices

enter image description here

通过使用熊猫,我可以分组并计算唯一的设备:

devices_per_time = devices.groupby(['cust_id', 'time'])['device_id'].unique().to_frame()
devices_per_time

enter image description here

我尝试对dask做同样的事情:

enter image description here

我有以下问题:

  1. 为什么不能使用devices.groupby(['cust_id','time'])['device_id']。unique()?
  2. 我设法获得结果,但是我不确定这是否是最佳结果。有人可以确认我使用正确的方式吗?

致谢。

1 个答案:

答案 0 :(得分:0)

您无法执行.unique(),因为dask系列尚未实现。检查可用功能:SeriesGroupby

这是使用并行的applyset获得结果的另一种方法:

(devices
.groupby(['time','cust_id'])['device_id']
.apply(set, meta=object)
.apply(list,meta=object)
.compute()
.reset_index())

如果您不关心最终类型(集合或列表),则可以删除.apply(list,meta=object)