我有几个数据框:
import pandas as pd
import numpy as np
router = pd.DataFrame([
['2018-01-01 00:00:00', '1', 5],
['2018-01-01 00:30:00', '1', 7],
['2018-01-01 01:00:00', '1', 25],
['2018-01-01 01:30:00', '1', 3],
['2018-01-01 00:00:00', '2', 25],
['2018-01-01 00:30:00', '2', 7],
['2018-01-01 01:00:00', '2', 25],
['2018-01-01 01:30:00', '2', 35],
], columns=['time', 'cust_id', 'errors'])
router
devices = pd.DataFrame([
['2018-01-01 00:00:00', '1', 'dev_1'],
['2018-01-01 00:30:00', '1', 'dev_1'],
['2018-01-01 00:30:00', '1', 'dev_2'],
['2018-01-01 01:00:00', '1', 'dev_1'],
['2018-01-01 01:00:00', '1', 'dev_2'],
['2018-01-01 01:00:00', '1', 'dev_3'],
['2018-01-01 01:30:00', '1', 'dev_2'],
['2018-01-01 00:00:00', '2', 'dev_1'],
['2018-01-01 00:00:00', '2', 'dev_2'],
['2018-01-01 00:30:00', '2', 'dev_1'],
['2018-01-01 01:00:00', '2', 'dev_2'],
['2018-01-01 01:00:00', '2', 'dev_3'],
['2018-01-01 01:30:00', '2', 'dev_2'],
['2018-01-01 01:30:00', '2', 'dev_4'],
], columns=['time', 'cust_id', 'device_id'])
devices
通过使用熊猫,我可以分组并计算唯一的设备:
devices_per_time = devices.groupby(['cust_id', 'time'])['device_id'].unique().to_frame()
devices_per_time
我尝试对dask做同样的事情:
我有以下问题:
致谢。
答案 0 :(得分:0)
您无法执行.unique()
,因为dask系列尚未实现。检查可用功能:SeriesGroupby
这是使用并行的apply
和set
获得结果的另一种方法:
(devices
.groupby(['time','cust_id'])['device_id']
.apply(set, meta=object)
.apply(list,meta=object)
.compute()
.reset_index())
如果您不关心最终类型(集合或列表),则可以删除.apply(list,meta=object)