Question

我需要根据与前几行相关的某些条件为我的熊猫行创建一个唯一的“ ID”字段。

下面您将看到我的数据示例：

  current_driver customer_id    pu_actual_dt      service
0        167       1214      2018-06-28 13:24:00    DED
1        167       1214      2018-06-28 13:25:00    DED
2        167       1214      2018-06-28 14:43:00    DED
3        243       1214      2018-06-28 19:41:00    DED
4        243       1214      2018-06-28 19:41:00    DED
5        250       1214      2018-06-28 17:19:00    DED
6        250       1214      2018-06-28 18:00:00    DED
7        250       1214      2018-06-28 18:18:00    DED
8        259       1214      2018-06-28 19:40:00    DED
9        259       1214      2018-06-28 19:40:00    DED
10       259       1214      2018-06-28 20:14:00    DED
11       260       1214      2018-06-28 17:39:00    DED
12       260       1214      2018-06-28 17:39:00    DED
13       260       1214      2018-06-28 17:39:00    DED
14       260       1214      2018-06-28 17:39:00    DED
15       263       1214      2018-06-28 18:34:00    DED
16       263       1214      2018-06-28 18:43:00    DED
17       263       1214      2018-06-28 18:43:00    DED

我需要做的是使用以下逻辑创建另一列：如果current_driver与上一行的current_driver相同，并且customer_id与上一行的customer_id相同，并且pu_actual_dt在一半以内-上一行的-hour，那么它应该都具有相同的ID。因此，对于前两行，它将以“ 1”开始，但是由于第三行pu_actual_dt超过了一个半小时，因此其ID为“ 2”。然后，第四行具有不同的驱动程序，因此ID与行＃5相同，因此其ID为“ 3”，因为它具有与行＃4相同的驱动程序/ customer_id / pu_actual_dt。

在说明pu_actual_dt（请参见前两行）中的细微差别之前，我能够通过串联字段并在每次行与上一个串联不匹配时重新启动ID来解决该问题。因此，例如，我使用它来创建ID之前：

df = df.assign(id=(df['route_concate']).astype('category').cat.codes)

但是，当我在pu_actual_dt中存在细微差别时，这种串联逻辑将无法正常工作。

因此，我尝试通过以下方式说明次要时间变化：

df['id'] = np.where((df['current_driver'] == df['current_driver'].shift(1) ) 
& (df['customer_id'] == df['customer_id'].shift(1)) 
& (df['pu_actual_dt'] < df['pu_actual_dt'].shift(1) + pd.Timedelta(minutes=30)) 
& (df['pu_actual_dt'] > df['pu_actual_dt'].shift(1) - pd.Timedelta(minutes=30)) 
& (df['service'] == 'DED'), df['id'].shift(1), df['id'].shift(1) + 1)

我要在这里做的是每一行，如果current_driver = current_driver在上一行中，而customer_id = customer_id在上一行中，则pu_actual_dt在上一行pu_actual_dt之前或之后30分钟之内，并且service =' DED”，然后使用上一行的ID。如果不是，则在上一行的ID上加1。

我不确定我在做什么错，但是它返回了一些非常不可预测的结果。某一时刻它从ID 75下降到34，然后又回到36？

什么是解决我的问题的更好方法？（还有一个ID将从“ 1”开始的位置）。感谢您一如既往的帮助！

Answer 1

您的np.where是个好主意，相差很小：如果不满足条件，则分配1，如果满足，则分配None，例如：

df['id'] = np.where((df['current_driver'] == df['current_driver'].shift(1) ) 
& (df['customer_id'] == df['customer_id'].shift(1)) 
& (df['pu_actual_dt'] < df['pu_actual_dt'].shift(1) + pd.Timedelta(minutes=30)) 
& (df['pu_actual_dt'] > df['pu_actual_dt'].shift(1) - pd.Timedelta(minutes=30)) 
& (df['service'] == 'DED'), None, 1) # NOTE the None and 1 here are explain above

现在，您要在1的每一行中增加id，因此请使用cumsum，ffill和astype（整数而不是浮点数），例如：

df['id'] = df['id'].cumsum().ffill().astype(int)

给出您的例子

    current_driver  customer_id        pu_actual_dt service  id
0              167         1214 2018-06-28 13:24:00     DED   1
1              167         1214 2018-06-28 13:25:00     DED   1
2              167         1214 2018-06-28 14:43:00     DED   2
3              243         1214 2018-06-28 19:41:00     DED   3
4              243         1214 2018-06-28 19:41:00     DED   3
5              250         1214 2018-06-28 17:19:00     DED   4
6              250         1214 2018-06-28 18:00:00     DED   5
7              250         1214 2018-06-28 18:18:00     DED   5
8              259         1214 2018-06-28 19:40:00     DED   6
9              259         1214 2018-06-28 19:40:00     DED   6
10             259         1214 2018-06-28 20:14:00     DED   7
11             260         1214 2018-06-28 17:39:00     DED   8
12             260         1214 2018-06-28 17:39:00     DED   8
13             260         1214 2018-06-28 17:39:00     DED   8
14             260         1214 2018-06-28 17:39:00     DED   8
15             263         1214 2018-06-28 18:34:00     DED   9
16             263         1214 2018-06-28 18:43:00     DED   9
17             263         1214 2018-06-28 18:43:00     DED   9

Python Pandas根据条件创建运行中的“ id”

1 个答案: