根据上一行的输出分配值

时间:2019-03-08 00:11:37

标签: python pandas dataframe

我正在使用熊猫分析应用程序的输出日志,并希望将每个条目分配给会话。会话定义为从开始算起的60分钟。

这是一个小例子:

import numpy as np
import pandas as pd
from datetime import timedelta

> df = pd.DataFrame({
    'time': [
        pd.Timestamp(2019, 1, 1, 1, 10),
        pd.Timestamp(2019, 1, 1, 1, 15),
        pd.Timestamp(2019, 1, 1, 1, 20),
        pd.Timestamp(2019, 1, 1, 2, 20),
        pd.Timestamp(2019, 1, 1, 5, 0),
        pd.Timestamp(2019, 1, 1, 5, 15)
    ]
})

> df
                   time
0   2019-01-01 01:10:00
1   2019-01-01 01:15:00
2   2019-01-01 01:20:00
3   2019-01-01 02:20:00
4   2019-01-01 05:00:00
5   2019-01-01 05:15:00

对于第一行,start_time等于time。对于后续行,如果其time在上一行的1小时之内,则认为它在同一会话中。如果没有,它将以start_time = time开始新的会话。我正在使用循环:

df['start_time'] = np.nan

for index in df.index:
    if index == 0:
        start_time = df['time'][index]
    else:
        delta = df['time'][index] - df['time'][index - 1]
        start_time = df['start_time'][index - 1] if delta.total_seconds() <= 3600 else df['time'][index]

    df['start_time'][index] = start_time

输出:

                   time          start_time
0   2019-01-01 01:10:00 2019-01-01 01:10:00
1   2019-01-01 01:15:00 2019-01-01 01:10:00
2   2019-01-01 01:20:00 2019-01-01 01:10:00
3   2019-01-01 02:20:00 2019-01-01 01:10:00
4   2019-01-01 05:00:00 2019-01-01 05:00:00 # new session
5   2019-01-01 05:15:00 2019-01-01 05:00:00

工作正常,但速度很慢。有矢量化方法吗?

2 个答案:

答案 0 :(得分:2)

diffcumsum一起使用可创建组密钥,然后我们只需使用该密钥即可获得每个组的first

s=(df.time.diff()/np.timedelta64(1, 's')).gt(3600).cumsum()
df.groupby(s)['time'].transform('first')
Out[833]: 
0   2019-01-01 01:10:00
1   2019-01-01 01:10:00
2   2019-01-01 01:10:00
3   2019-01-01 01:10:00
4   2019-01-01 05:00:00
5   2019-01-01 05:00:00
Name: time, dtype: datetime64[ns]
df['statr_time']=df.groupby(s)['time'].transform('first')

答案 1 :(得分:1)

我使用np where,shift和cumsum来创建会话ID。然后我用transform和min来获取开始时间

library(dplyr)

das %>%
   group_by(label) %>%
   filter(any(value > 4)) %>%
   ungroup() %>%
   group_by(label, category) %>%
   slice(which.max(value))


#    val weigh value label category
#  <int> <dbl> <dbl> <dbl> <fct>   
#1     1    10   4.1     1 A       
#2     2    10   3.2     1 B       
#3     6    11   5.3     1 C       
#4    10    21   3.1     2 A       
#5    11    21   8.2     2 B       
#6     9    20   3.3     2 C       
#7    19    40   7.2     4 A       
#8    20    40   4.5     4 B       
#9    24    41   9.1     4 C