Question

给定活动记录在一个熊猫数据框中，该数据框由在指定的“时间戳”处经历一系列“动作”的“ id”组成-我想保留与指定动作序列相对应的行。

例如输入数据

import pandas as pd 
# Create a sample data-frame from a dictionary 
id = ['A123', 'A123', 'A123', 'A123', 'A123', 'A123', 'A234', 'A234', 'A234', 'A234', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341']
action = ['A', 'B', 'C', 'D', 'B', 'A', 'B', 'A', 'C', 'D', 'D', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'B', 'C']
timestamp = ['1', '2', '3', '4', '5', '6', '1', '2', '3', '4', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
the_dict = {'id': id, 'action': action, 'timestamp': timestamp}
# This is the sample data-frame with columns:
# id    action    timestamp
# Each id when ordered by timestamp then action gives the sequence of actions taken by the id
dataFrame = pd.DataFrame(the_dict)
######################################
# Input data
######################################
#      id action timestamp
#0   A123      A         1
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#4   A123      B         5
#5   A123      A         6
#6   A234      B         1
#7   A234      A         2
#8   A234      C         3
#9   A234      D         4
#10  A341      D         1
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#14  A341      A         5
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8
#18  A341      B         9
#19  A341      C        10

# The sequence of interest
the_sequence = ['B', 'C', 'D']

# Desired output: Group by id, order by timestamp, return all rows which match the given sequence of actions
######################################
# The output data-frame:
######################################
#      id action timestamp
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8

Answer 1

您可以对.shift，A和B使用C逻辑。基本上，您要检查在接下来的行中有A和B的{{1}}行。这将返回C。然后，对A和B遵循类似的协议。

输出：

df = (df[df.groupby('id')['action'].
     apply(lambda x:
           (x == 'B') & (x.shift(-1) == 'C') & (x.shift(-2) == 'D') |
           (x == 'C') & (x.shift(1) == 'B') & (x.shift(-1) == 'D') |
           (x == 'D') & (x.shift(2) == 'B') & (x.shift(1) == 'C'))])
df

Answer 2

我们可以做cumsum + str.contains

m=df.groupby('id').action.apply(lambda x : (x+',').cumsum()).str.contains('B,C,D')
nedf=df[m]
nedf
      id action timestamp
3   A123      D         4
4   A123      B         5
5   A123      A         6
13  A341      D         4
14  A341      A         5
15  A341      B         6
16  A341      C         7
17  A341      D         8
18  A341      B         9
19  A341      C        10

Answer 3

如果需要查找序列，可以在列表推导similar to my answer here中使用np.logical_and.reduce + shift。在这种情况下，还需要考虑分组，但是考虑到您可以使用shift进行排序。

这里的想法是找到与序列中第一个元素相等的所有行。然后通过shift我们检查之后的行是否等于第二个元素（并确保它在同一组中）。 m将为我们提供序列结束处的所有索引，因此我们可以使用该索引形成掩码以切片原始DataFrame。

import numpy as np

def find_seq_within_group(df, seq, seq_col, gp_col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[seq_col].shift(i).eq(seq[i]) & df[gp_col].shift(i).eq(df[gp_col]) 
                               for i in range(len(seq))])
    
    # Return entire sequence
    m = np.logical_or.reduce([np.roll(m, -i) for i in range(len(seq))])
    
    return df.loc[m]

# df = df.sort_values(['id', 'timestamp'])
find_seq_within_group(df=df, seq=['B', 'C', 'D'], seq_col='action', gp_col='id')

      id action timestamp
1   A123      B         2
2   A123      C         3
3   A123      D         4
11  A341      B         2
12  A341      C         3
13  A341      D         4
15  A341      B         6
16  A341      C         7
17  A341      D         8

Answer 4

这可能也有帮助：

sequence=['A','B','C','D']
n=len(sequence)
for i in  range(dataFrame.shape[0]):
    if(list(dataFrame['action'][i:i+n].values)==sequence):
        print ("the sequence starts at",i)
    else:
        continue

熊猫仅保留指定的子序列（groupby order保留子序列）

4 个答案: