Question

在几个月的时间里，我的应用程序的几十个用户都有一个文本字段交互的数据集。我正在尝试计算熊猫击键之间的平均时间。数据看起来像这样：

timestamp                before_text     after_text
1453481138188                  NULL               a
1453481138600                     a              ab 
1453481138900                    ab             abc
1453481139400                   abc            abcd
1453484000000    Enter some numbers               1
1453484000100                     1              12
1453484000600                    12             123

timestamp包含用户按下键的unix时间，before_text是用户点击键之前包含的文本字段，after_text是字段的外观击键后。

这样做的最佳方法是什么？我知道这并不像做一样简单：

(df["timestamp"] - df["timestamp"].shift()).mean()

因为这将计算两个交互之间边界上的非常大的时间差。看起来最好的方法是将每行的一些函数传递给df.groupby，这样我就可以将上面的代码片段应用到每一行。如果我有这个magic_function我可以做类似的事情：

df.groupby(magic_function).apply(lambda x: x["timestamp"] - x["timestamp"].shift()).mean()

实施magic_function的好方法是什么，或者我认为这一切都错了？

Answer 1

我是通过计算＆＃39;之前的文字差异来实现的。＆＃39;之后＆＃39;。如果差异大于某个阈值，那么这是一个新会话。

需要pip install python-levenshtein。我通过from Levenshtein import distance as ld import pandas as pd # taking just these two columns and transposing and back filling. # I back fill for one reason, to fill that pesky NA with after text. before_after = df[['before_text', 'after_text']].T.bfill() distances = before_after.apply(lambda x: ld(*x)) # threshold should be how much distance constitutes an obvious break in sessions. threshold = 2 magic_function = (distances > 2).cumsum() df.groupby(magic_function) \ .apply(lambda x: x["timestamp"] - x["timestamp"].shift()) \ .mean() 362.4安装了它，如下所示：

{{1}}

然后：

{{1}}

Answer 2

您的问题主要是确定给定交互何时停止以及何时开始交互。也许计算timestamp s之间的差异，如果大于阈值，设置一个可以分组的标志。

thresh = 1e5
ts = (df['timestamp'] - df['timestamp'].shift()) > thresh
grp = [0]
for i in range(len(ts)):
    if ts.iloc[i]:
        grp.append(grp[-1] + 1)
    else:
        grp.append(grp[-1])
grp.append(grp[-1])
df['grouper'] = grp

现在您可以简单地进行分组：grouped = df.groupby('grouper')，然后减去组内的timestamp，并计算平均差异。

我试图想办法避免循环，但在此之前尝试一下，让我知道它是怎么回事。

计算文字字段互动之间的时间

2 个答案: