在熊猫中开始和停止时间段

时间:2018-01-09 14:46:26

标签: python python-3.x pandas

我在pandas中有以下数据框:

+----------+-------------------+---------------------+----------+------------+
| UserName | MainOperationName |  Submission_Ended   |  delta   | new period |
+----------+-------------------+---------------------+----------+------------+

| User1    | Record submission | 2017-07-31 00:08:25 | 00:00:00 | False      |
| User1    | Record submission | 2017-07-31 00:12:02 | 00:03:37 | False      |
| User1    | Record submission | 2017-07-31 00:14:51 | 00:02:49 | False      |
| User1    | Record submission | 2017-07-31 00:17:27 | 00:02:36 | False      |
| User1    | Record submission | 2017-07-31 00:23:42 | 00:06:15 | False      |
| User1    | Record submission | 2017-07-31 00:25:35 | 00:01:53 | False      |
| User1    | Record submission | 2017-07-31 00:26:01 | 00:00:26 | False      |
| User1    | Record submission | 2017-07-31 01:59:11 | 01:33:10 | True       |
| User1    | Record submission | 2017-07-31 02:00:37 | 00:01:26 | False      |
| User1    | Record submission | 2017-07-31 02:03:12 | 00:02:35 | False      |
| User1    | Record submission | 2017-07-31 02:21:22 | 00:18:10 | False      |
| User1    | Record submission | 2017-07-31 02:30:28 | 00:09:06 | False      |
| User1    | Record submission | 2017-07-31 02:36:03 | 00:05:35 | False      |
| User1    | Record submission | 2017-07-31 03:25:43 | 00:49:40 | True       |
+----------+-------------------+---------------------+----------+------------+

Delta列只是Submission_Ended行之间的差异。当差异大于20分钟时,new period为True。我想我也会强制第一行值为True,因为它是在新的时段开始时。我假设当delta小于用户正在使用该应用程序时,否则他/她正在休息。我想用时间轴/甘特图来形象化(如上一节here)。但为此,我需要开始和停止每个时期,在这种情况下:

  • 开始时间:2017-07-31 00:08:25;停止:2017-07-31 00:26:01
  • 开始时间:2017-07-31 01:59:11;停止:2017-07-31 02:36:03
  • 开始时间:2017-07-31 03:25:43;停止:...

任何想法我怎么能从那样的数据结构中得到它?只是要提一下,在我的真实数据框中,有数百个用户。

1 个答案:

答案 0 :(得分:0)

泰勒的评论非常有用。我做了以下事情:

session_id = 0

def get_session_id(row):
    global session_id

    if row == True:
        session_id += 1
        return session_id
    else:
        return session_id

df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})

我的最终结果是:

+------------+---------------------+---------------------+
|            |        start        |        stop         |
+------------+---------------------+---------------------+
| session id |                     |                     |
| 1          | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2          | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3          | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4          | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5          | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6          | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+

所以我可以用它!

现在我不喜欢使用session_id语句分配global的方式。知道如何以更整洁的方式做到这一点吗?