我在pandas中有以下数据框:
+----------+-------------------+---------------------+----------+------------+
| UserName | MainOperationName | Submission_Ended | delta | new period |
+----------+-------------------+---------------------+----------+------------+
| User1 | Record submission | 2017-07-31 00:08:25 | 00:00:00 | False |
| User1 | Record submission | 2017-07-31 00:12:02 | 00:03:37 | False |
| User1 | Record submission | 2017-07-31 00:14:51 | 00:02:49 | False |
| User1 | Record submission | 2017-07-31 00:17:27 | 00:02:36 | False |
| User1 | Record submission | 2017-07-31 00:23:42 | 00:06:15 | False |
| User1 | Record submission | 2017-07-31 00:25:35 | 00:01:53 | False |
| User1 | Record submission | 2017-07-31 00:26:01 | 00:00:26 | False |
| User1 | Record submission | 2017-07-31 01:59:11 | 01:33:10 | True |
| User1 | Record submission | 2017-07-31 02:00:37 | 00:01:26 | False |
| User1 | Record submission | 2017-07-31 02:03:12 | 00:02:35 | False |
| User1 | Record submission | 2017-07-31 02:21:22 | 00:18:10 | False |
| User1 | Record submission | 2017-07-31 02:30:28 | 00:09:06 | False |
| User1 | Record submission | 2017-07-31 02:36:03 | 00:05:35 | False |
| User1 | Record submission | 2017-07-31 03:25:43 | 00:49:40 | True |
+----------+-------------------+---------------------+----------+------------+
Delta
列只是Submission_Ended
行之间的差异。当差异大于20分钟时,new period
为True。我想我也会强制第一行值为True,因为它是在新的时段开始时。我假设当delta小于用户正在使用该应用程序时,否则他/她正在休息。我想用时间轴/甘特图来形象化(如上一节here)。但为此,我需要开始和停止每个时期,在这种情况下:
任何想法我怎么能从那样的数据结构中得到它?只是要提一下,在我的真实数据框中,有数百个用户。
答案 0 :(得分:0)
泰勒的评论非常有用。我做了以下事情:
session_id = 0
def get_session_id(row):
global session_id
if row == True:
session_id += 1
return session_id
else:
return session_id
df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})
我的最终结果是:
+------------+---------------------+---------------------+
| | start | stop |
+------------+---------------------+---------------------+
| session id | | |
| 1 | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2 | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3 | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4 | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5 | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6 | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+
所以我可以用它!
现在我不喜欢使用session_id
语句分配global
的方式。知道如何以更整洁的方式做到这一点吗?