Question

我想在python / pandas中的数据框中添加一列，如下所示：

| MarketID  | SelectionID |   Time   | SelectNumber |

| 112337406 | 3819251.0   | 13:38:32 |            4 |

| 112337406 | 3819251.0   | 13:39:03 |            4 |

| 112337406 | 4979206.0   | 11:29:34 |            1 |

| 112337406 | 4979206.0   | 11:37:34 |            1 |

| 112337406 | 5117439.0   | 13:36:32 |            3 |

| 112337406 | 5117439.0   | 13:37:03 |            3 |

| 112337406 | 5696467.0   | 13:23:03 |            2 |

| 112337406 | 5696467.0   | 13:23:33 |            2 |

| 112337407 | 3819254.0   | 13:39:12 |            4 |

| 112337407 | 4979206.0   | 11:29:56 |            1 |

| 112337407 | 4979206.0   | 16:27:34 |            1 |

| 112337407 | 5117441.0   | 13:36:54 |            3 |

| 112337407 | 5117441.0   | 17:47:11 |            3 |

| 112337407 | 5696485.0   | 13:23:04 |            2 |

| 112337407 | 5696485.0   | 18:23:59 |            2 |

我目前有市场ID，选择ID和时间，我想生成SelectNumber列，它代表特定selectionID在特定MarketID中出现的时间顺序。一旦编号，该MarketID中相同选择ID的所有其他迭代需要编号相同。 MarketID将始终是唯一的，但相同的selectionID可以出现在多个MarketID中。

这让我感到难过，有什么想法吗？

Answer 1

首先，您需要按照发生的顺序组合“MarketID”和“SelectionID”，因此我们可以对时间进行排序。然后，为每个'MarketID'获取唯一的'SelectionID'并按发生顺序对它们进行编号（已经订购，因为df按列时间排序）。其次，数字'MarketID'和'SelectionID'以及订单的组合将在稍后用于设置数字。

我会给你第一部分的两个解决方案：

dfnewindex = df.sort_values('Time').set_index('MarketID')
valuesetter = {}
for indx in dfnewindex.index.unique():
    selectionid_per_marketid  = dfnewindex.loc[indx].sort_values('Time')['SelectionID'].drop_duplicates().values
    valuesetter.update(dict(zip(zip(len(selectionid_per_marketid)*[indx], selectionid_per_marketid), range(1, 1+len(selectionid_per_marketid)))))

100个循环，最佳3：每循环3.22 ms

df_sorted = df.sort_values('Time')
valuesetter = {}
for mrktid in df_sorted['MarketID'].unique():
    sltnids = df_sorted[df_sorted['MarketID']==mrktid]['SelectionID'].drop_duplicates(keep='first').values
    valuesetter.update(dict(zip(zip(len(sltnids)*[mrktid], sltnids), range(1, 1+len(sltnids)))))

100个循环，最佳3：每循环2.59 ms

在这种情况下，布尔切片解决方案稍快一些

输出：

valuesetter

{(112337406, 3819251.0): 4,
 (112337406, 4979206.0): 1,
 (112337406, 5117439.0): 3,
 (112337406, 5696467.0): 2,
 (112337407, 3819254.0): 4,
 (112337407, 4979206.0): 1,
 (112337407, 5117441.0): 3,
 (112337407, 5696485.0): 2}

对于第二部分，此dict用于生成列，即SelectNumber。再两个解决方案，第一个使用multiindex，第二个使用：

map(lambda x: valuesetter[x], df.set_index(['MarketID', 'SelectionID']).index.values)

1000次循环，最佳3次：每次循环1.23 ms

map(lambda x: valuesetter[x], df.groupby(['MarketID', 'SelectionID']).count().index.values)

1000次循环，最佳3：每循环1.59 ms

multiindex似乎是最快的解决方案。

最后，到目前为止，答案最快：

df_sorted = df.sort_values('Time')
valuesetter2 = {}
for mrktid in df_sorted['MarketID'].unique():
    sltnids = df_sorted[df_sorted['MarketID']==mrktid]['SelectionID'].drop_duplicates(keep='first').values
    valuesetter2.update(dict(zip(zip(len(sltnids)*[mrktid], sltnids), range(1, 1+len(sltnids)))))
df_sorted['SelectNumber'] = list(map(lambda x: valuesetter[x], df.set_index(['MarketID', 'SelectionID']).index.values))

在pandas数据框中对项目进行分组和编号

1 个答案: