Question

我有以下pandas数据帧：编辑：它按created_date

排序

    created_date    incoming_message
0   11/13/2014 18:06    1
1   11/13/2014 21:56    0
2   11/14/2014 3:40     1
3   11/14/2014 3:55     1
4   11/14/2014 5:09     0

incoming_message表示消息的方向（1 =入站，0 =出站）。我试图找出消息交换的平均时间。含义，1）响应多长时间（传出消息），以及2）回听多长时间（传入消息）。如果有多个传入消息，我想根据第一个传入消息计算持续时间。

有关

2   11/14/2014 3:40     1
3   11/14/2014 3:55     1
4   11/14/2014 5:09     0

我应该计算

之间的持续时间

2   11/14/2014 3:40     1
4   11/14/2014 5:09     0

这是我的尝试：

def responseTime(df):
    ttr = [] #time to respond
    tth = [] #time to hear back
    i = 0
    j = i+1
    while j <= df.count().max()-1:
        while df.iloc[i]['incoming_message'] == df.iloc[j]['incoming_message']:
            j += 1
        fd = df.iloc[i]
        nd = df.iloc[j]
        if fd['incoming_message'] != nd['incoming_message']:
            if fd['incoming'] == 1:
                ttr.append((nd['created_date'] - fd['created_date']).seconds/3600.0)
            else:
                tth.append((nd['created_date'] - fd['created_date']).seconds/3600.0)
        i = j
        j = i+1
    return np.mean(ttr), np.mean(tth)

虽然这个功能有效，但我觉得有更有效的方法来解决这个问题。任何反馈和建议将不胜感激！

Answer 1

不确定您想要输出的内容（例如，如果您想要转换）。这是groupby。

In [91]: df
Out[91]: 
                 date  value
0 2014-11-13 18:06:00      1
1 2014-11-13 21:56:00      0
2 2014-11-14 03:40:00      1
3 2014-11-14 03:55:00      1
4 2014-11-14 05:09:00      0

创建石斑鱼。这是一个分段器，它可以找到值更改的断点，并根据该值创建组。

In [92]: grouper = (df.value.diff(1)==1).cumsum()

In [93]: grouper
Out[93]: 
0    0
1    0
2    1
3    1
4    1
Name: value, dtype: int64

In [94]: g = df.groupby(grouper)

计算日期列的最后一个值减去第一个值，得到一个timedelta。这些是每个GROUP（例如，这是索引所代表的）。

In [95]: g['date'].last()-g['date'].first()
Out[95]: 
value
0       03:50:00
1       01:29:00
Name: date, dtype: timedelta64[ns]

如果您想保留原始数据的来源。这是一种转换类型的操作。

In [105]: result = g['date'].transform('last')-g['date'].transform('first')

In [106]: result
Out[106]: 
0   03:50:00
1   03:50:00
2   01:29:00
3   01:29:00
4   01:29:00
dtype: timedelta64[ns]

然后，您需要选择原始断点发生的索引。

In [108]: result.iloc[grouper.drop_duplicates(take_last=True).index]
Out[108]: 
1   03:50:00
4   01:29:00
dtype: timedelta64[ns]

这些都是非常高效的，因为它们都是矢量化操作。

计算交替时间戳与皱纹之间的时差（Python Pandas）

1 个答案: