Question

使用Pandas API for Python，对于每个时间戳，我想计算每个时间戳在最近48小时内在帐户中看到的唯一设备的数量。

这是我的数据：

╔═════════════════════╦══════════╦═══════════╗ ║ timestamp ║ device ║ accountid ║ ╠═════════════════════╬══════════╬═══════════╣ ║ 2018-10-29 18:52:30 ║ d1ed6e6 ║ DhHUXPw ║ ║ 2018-11-01 18:52:30 ║ d123ff96 ║ zgffRDY ║ ║ 2018-11-01 20:53:30 ║ e322ff96 ║ zgffRDY ║ ║ 2018-11-02 21:33:30 ║ g133gf42 ║ zgffRDY ║ ║ 2018-11-15 18:52:30 ║ d123ff96 ║ awfdsct ║ ║ 2018-11-17 08:25:30 ║ d123ff96 ║ awfdsct ║ ╚═════════════════════╩══════════╩═══════════╝

我除了输出看起来像这样。本质上，对于zgffRDY上的帐户2018-11-02 21:33:30，我们在过去48小时内看到了3个唯一设备，而在2018-11-01 18:52:30，我们只看到了1个设备（是当前设备）

╔═════════════════════╦══════════╦═══════════╦═══════════════════════════╗ ║ timestamp ║ device ║ accountid ║ last_48hour_device_count ║ ╠═════════════════════╬══════════╬═══════════╬═══════════════════════════╣ ║ 2018-10-29 18:52:30 ║ d1ed6e6 ║ DhHUXPw ║ 1 ║ ║ 2018-11-01 18:52:30 ║ d123ff96 ║ zgffRDY ║ 1 ║ ║ 2018-11-01 20:53:30 ║ e322ff96 ║ zgffRDY ║ 2 ║ ║ 2018-11-02 21:33:30 ║ g133gf42 ║ zgffRDY ║ 3 ║ ║ 2018-11-15 18:52:30 ║ d123ff96 ║ awfdsct ║ 1 ║ ║ 2018-11-16 08:25:30 ║ d123ff96 ║ awfdsct ║ 1 ║ ╚═════════════════════╩══════════╩═══════════╩═══════════════════════════╝

我当前的代码如下所示。

count_list = [] for idx, row in df.iterrows(): account = row['accountid'] earliest = row['timestamp'] - pd.to_timedelta('48 hours') current_time = row['timestamp'] filtered_data = df.query('timestamp >= @earliest and ' 'timestamp < @current_time and ' 'accountid == @account') device_cnt = len(set(filtered_data['device'])) count_list.append(device_cnt) df['last_48hour_device_count'] = count_list

我得到正确的输出，但是我的代码运行得太慢了，并且我有一个包含大量观察结果的数据集。

您知道解决此问题的更好方法吗？

Answer 1

根据描述，您要应用的逻辑并不完全清楚，但是pandas groupby方法应该根据您的描述给出所需的内容。

呼叫看起来像这样：

df.groupby(['timestamp','accountId']).cumcount()

Answer 2

您将重点放在帐户ID上，所以我的建议是首先groupby static/字段。

添加了设备ID字段后，该问题与this SO问题非常相似。所以我认为您最终的结果如下：

accountid

熊猫：计算最近48小时内每个帐户看到的设备数量

2 个答案: