尝试根据三个条件创建新的id列时遇到问题?

时间:2018-11-11 11:30:09

标签: python pandas dataframe

我有一个带有对话和时间戳的数据框,如下所示:

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...
2018-10-05 23:07:13 01  a large text blob...
2018-10-05 23:07:23 01  a large text blob...
2018-10-05 23:07:36 01  a large text blob...
2018-10-05 23:08:02 01  a large text blob...
2018-10-05 23:09:16 01  a large text blob...
2018-10-05 23:09:21 01  a large text blob...
2018-10-05 23:09:39 01  a large text blob...
2018-10-05 23:09:47 01  a large text blob...
2018-10-05 23:10:01 01  a large text blob...
2018-10-05 23:10:11 01  a large text blob...
2018-10-05 23:10:23 01  restart             
2018-10-05 23:10:59 01  a large text blob...
2018-10-05 23:11:03 01  a large text blob...
2018-10-08 23:11:32 02  a large text blob...
2018-10-08 23:12:58 02  a large text blob...
2018-10-08 23:13:16 02  a large text blob...
2018-10-08 23:14:04 02  a large text blob...
2018-10-08 03:38:36 02  a large text blob...
2018-10-08 03:38:42 02  a large text blob...
2018-10-08 03:38:52 02  a large text blob...
2018-10-08 03:38:57 02  a large text blob...
2018-10-08 03:39:10 02  a large text blob...
2018-10-08 03:39:27 02  Restart             
2018-10-08 03:40:47 02  a large text blob...
2018-10-08 03:40:54 02  a large text blob...
2018-10-08 03:41:02 02  a large text blob...
2018-10-08 03:41:12 02  a large text blob...
2018-10-08 03:41:32 02  a large text blob...
2018-10-08 03:41:39 02  a large text blob...
2018-10-08 03:42:20 02  a large text blob...
2018-10-08 03:44:58 02  a large text blob...
2018-10-08 03:45:54 02  a large text blob...
2018-10-08 03:46:06 02  a large text blob...
2018-10-08 05:06:42 03  a large text blob...
2018-10-08 05:06:53 03  a large text blob...
2018-10-08 05:08:49 03  a large text blob...
2018-10-08 05:08:58 03  a large text blob...
2018-10-08 05:58:18 04  a large text blob...
2018-10-08 05:58:26 04  a large text blob...
2018-10-08 05:58:37 04  a large text blob...
2018-10-08 05:58:58 04  a large text blob...
2018-10-08 06:00:31 04  a large text blob...
2018-10-08 06:01:00 04  a large text blob...
2018-10-08 06:01:14 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:06:03 04  a large text blob...
2018-10-08 06:10:00 04  a large text blob...
2018-10-08 09:07:03 04  a large text blob...
2018-10-08 09:09:03 04  a large text blob...
2018-10-09 10:01:00 04  a large text blob...
2018-10-09 10:02:00 04  a large text blob...
2018-10-09 10:03:00 04  a large text blob...
2018-10-09 10:09:00 04  a large text blob...
2018-10-09 10:09:00 05  a large text blob...

此刻,我想用ID识别数据框内的对话。问题在于用户可以有多个对话(即userID可以有多个textBlob关联)。因此,我想添加一个new_id以便能够识别上述数据框内的对话。

为此,我想基于三个条件创建一个new_id列:

  1. 10分钟时段
  2. 关键字的出现
  3. 当用户没有更多的文本框时

预期输出如下(*)

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...    001
2018-10-05 23:07:13 01  a large text blob...    001
2018-10-05 23:07:23 01  a large text blob...    001
2018-10-05 23:07:36 01  a large text blob...    001
2018-10-05 23:08:02 01  a large text blob...    001
2018-10-05 23:09:16 01  a large text blob...    001
2018-10-05 23:09:21 01  a large text blob...    001
2018-10-05 23:09:39 01  a large text blob...    001
2018-10-05 23:09:47 01  a large text blob...    001
2018-10-05 23:10:01 01  a large text blob...    001
2018-10-05 23:10:11 01  a large text blob...    001
2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01  a large text blob...    002
2018-10-05 23:11:03 01  a large text blob...    002
2018-10-08 23:11:32 02  a large text blob...    002
2018-10-08 23:12:58 02  a large text blob...    002
2018-10-08 23:13:16 02  a large text blob...    002
2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02  a large text blob...    003
2018-10-08 03:38:42 02  a large text blob...    003
2018-10-08 03:38:52 02  a large text blob...    003
2018-10-08 03:38:57 02  a large text blob...    003
2018-10-08 03:39:10 02  a large text blob...    003
2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02  a large text blob...    004
2018-10-08 03:40:54 02  a large text blob...    004
2018-10-08 03:41:02 02  a large text blob...    004
2018-10-08 03:41:12 02  a large text blob...    004
2018-10-08 03:41:32 02  a large text blob...    004
2018-10-08 03:41:39 02  a large text blob...    004
2018-10-08 03:42:20 02  a large text blob...    004
2018-10-08 03:44:58 02  a large text blob...    004
2018-10-08 03:45:54 02  a large text blob...    004
2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03  a large text blob...    005
2018-10-08 05:06:53 03  a large text blob...    005
2018-10-08 05:08:49 03  a large text blob...    005
2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04  a large text blob...    006
2018-10-08 05:58:26 04  a large text blob...    006
2018-10-08 05:58:37 04  a large text blob...    006
2018-10-08 05:58:58 04  a large text blob...    006
2018-10-08 06:00:31 04  a large text blob...    006
2018-10-08 06:01:00 04  a large text blob...    006
2018-10-08 06:01:14 04  a large text blob...    006
2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04  a large text blob...    007
2018-10-08 06:06:03 04  a large text blob...    007
2018-10-08 06:10:00 04  a large text blob...    007
2018-10-08 09:07:03 04  a large text blob...    007
2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04  a large text blob...    008
2018-10-09 10:02:00 04  a large text blob...    008
2018-10-09 10:03:00 04  a large text blob...    008
2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05  a large text blob...    010

到目前为止,我试图:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

但是,我还需要考虑userID,最后我得到了几列。有什么办法可以满足这三个条件并获得预期的输出(*)

2 个答案:

答案 0 :(得分:1)

您已到达那里。综上所述,为每个条件构建一个布尔掩码,然后将掩码转换为int并取其累积和:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp  userID              textBlob  new_id
2018-10-05 23:07:02       1  a_large_text_blob...       1
2018-10-05 23:07:13       1  a_large_text_blob...       1
2018-10-05 23:07:23       1  a_large_text_blob...       1
2018-10-05 23:07:36       1  a_large_text_blob...       1
2018-10-05 23:08:02       1  a_large_text_blob...       1
2018-10-05 23:09:16       1  a_large_text_blob...       1
2018-10-05 23:09:21       1  a_large_text_blob...       1
2018-10-05 23:09:39       1  a_large_text_blob...       1
2018-10-05 23:09:47       1  a_large_text_blob...       1
2018-10-05 23:10:01       1  a_large_text_blob...       1
2018-10-05 23:10:11       1  a_large_text_blob...       1
2018-10-05 23:10:23       1               restart       1
2018-10-05 23:10:59       1  a_large_text_blob...       2
2018-10-05 23:11:03       1  a_large_text_blob...       2
2018-10-08 03:11:32       2  a_large_text_blob...       3
2018-10-08 03:12:58       2  a_large_text_blob...       3
2018-10-08 03:13:16       2  a_large_text_blob...       3
2018-10-08 03:14:04       2  a_large_text_blob...       3
2018-10-08 03:38:36       2  a_large_text_blob...       4
2018-10-08 03:38:42       2  a_large_text_blob...       4
2018-10-08 03:38:52       2  a_large_text_blob...       4
2018-10-08 03:38:57       2  a_large_text_blob...       4
2018-10-08 03:39:10       2  a_large_text_blob...       4
2018-10-08 03:39:27       2               Restart       4
2018-10-08 03:40:47       2  a_large_text_blob...       5
2018-10-08 03:40:54       2  a_large_text_blob...       5
2018-10-08 03:41:02       2  a_large_text_blob...       5
2018-10-08 03:41:12       2  a_large_text_blob...       5
2018-10-08 03:41:32       2  a_large_text_blob...       5
2018-10-08 03:41:39       2  a_large_text_blob...       5
2018-10-08 03:42:20       2  a_large_text_blob...       5
2018-10-08 03:44:58       2  a_large_text_blob...       5
2018-10-08 03:45:54       2  a_large_text_blob...       5
2018-10-08 03:46:06       2  a_large_text_blob...       5
2018-10-08 05:06:42       3  a_large_text_blob...       6
2018-10-08 05:06:53       3  a_large_text_blob...       6
2018-10-08 05:08:49       3  a_large_text_blob...       6
2018-10-08 05:08:58       3  a_large_text_blob...       6
2018-10-08 05:58:18       4  a_large_text_blob...       7
2018-10-08 05:58:26       4  a_large_text_blob...       7
2018-10-08 05:58:37       4  a_large_text_blob...       7
2018-10-08 05:58:58       4  a_large_text_blob...       7
2018-10-08 06:00:31       4  a_large_text_blob...       7
2018-10-08 06:01:00       4  a_large_text_blob...       7
2018-10-08 06:01:14       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:06:03       4  a_large_text_blob...       7
2018-10-08 06:10:00       4  a_large_text_blob...       7
2018-10-08 09:07:03       4  a_large_text_blob...       8
2018-10-08 09:09:03       4  a_large_text_blob...       8
2018-10-09 10:01:00       4  a_large_text_blob...       9
2018-10-09 10:02:00       4  a_large_text_blob...       9
2018-10-09 10:03:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       5  a_large_text_blob...      10

答案 1 :(得分:-1)

好吧,我认为10分钟的时间应该从会话开始算起,而不是从下面的直接消息算起,在这种情况下,您需要遍历以下行:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
    timedelta = timestamp - start_time

    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
        current_id += 1
        start_time = timestamp

    new_id_prev = new_id    
    df.new_id.iloc[i] = current_id