我试图将复杂功能应用于pandas DataFrame,我想知道是否有更快的方法。我的数据的简化版本如下所示:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
我想要做的是对于UID和UID2的每个组合检查是否存在EventType = A和EventType = B的行,然后计算时差,然后将其作为新列添加回来。所以新的数据集将是:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
这是当前的实现,我按UID和UID2对记录进行分组,然后只搜索一小部分行来识别是否存在两个EventType。我无法找到更快的一个,而PyCharm中的分析并没有帮助发现瓶颈所在。
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min