我具有以下格式的事件数据:
event A A A A A C B C D A A A B
timestamp 0 3 4 4 5 5 6 7 7 8 8 9 10
给定序列S
和事件E
的列表,如何有效地找到在时间窗口{内S
中E
的{{1}}的不重叠情况{1}},并且事件中的每个事件都在距上一个事件W
的间隔内?
示例结果为L
:
S = {A, AA, AAA, AAB, BB, CA}, W=3, L=2
如您所见,事件不必是连续的(即序列中的所有元素都出现在序列中)。时间戳仅显示为整数。
答案 0 :(得分:3)
可以用一个传过来的数据,如果你跟踪有效迄今为止在不完全的子序列,一旦他们完成或不能完成任何更忘记他们所做的。为此,我编写了一个Sequence
类来跟踪
代码
events = 'AAAAACBCDAAAB'
timestamps = [0, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 10]
SEQUENCES = {'A', 'AA', 'AAA', 'AAB', 'BB', 'CA'}
WINDOW = 3
LENGTH = 2
class Sequence:
def __init__(self, seq, starting_index, starting_time):
self.sequence = seq
self.pos = 0
self.indices = [starting_index]
self.times = [starting_time]
self.has_expired = False
def is_next_event_acceptable(self, event, time):
if self.sequence[self.pos+1] != event:
return False
else:
if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
self.has_expired = True
return False
return True
def add_event_if_acceptable(self, event, index, time):
if self.is_next_event_acceptable(event, time):
self.pos += 1
self.indices.append(index)
self.times.append(time)
def is_complete(self):
return len(self.sequence) == self.pos + 1
def __repr__(self):
seq = list(self.sequence)
seq.insert(self.pos, '[')
seq.insert(self.pos + 2, ']')
return ''.join(seq)
def find_non_overlapping_subsequences(events, timestamps):
working_sequences = []
results = {s: {'seq': [], 'last_index': -1} for s in SEQUENCES}
for index, (event, time) in enumerate(zip(events, timestamps)):
# First work with any present sequences in the queue
# and then introduce any new ones
for Seq in working_sequences:
Seq.add_event_if_acceptable(event, index, time)
for seq in SEQUENCES:
if seq.startswith(event):
working_sequences.append(Sequence(seq, index, time))
# Any successfully completed sequences, or sequences
# that can't be completed anymore are to be removed
seq_idx_to_remove = []
for i, Seq in enumerate(working_sequences):
if Seq.has_expired:
seq_idx_to_remove.append(i)
elif Seq.is_complete():
seq_idx_to_remove.append(i)
# Only add the sequence to the results if the indices
# aren't overlapping with the previous one
sequence, times, indices = Seq.sequence, Seq.times, Seq.indices
if results[sequence]['last_index'] < indices[0]:
results[sequence]['seq'].append(times)
results[sequence]['last_index'] = indices[-1]
# We must remove the items in reverse order so that
# we don't disturb the 'forward' ordering
for i in seq_idx_to_remove[::-1]:
del working_sequences[i]
return results
results = find_non_overlapping_subsequences(events, timestamps)
for key, value in sorted(results.items()):
print(key, value['seq'])
输出
A [[0], [3], [4], [4], [5], [8], [8], [9]]
AA [[3, 4], [4, 5], [8, 8]]
AAA [[3, 4, 4], [8, 8, 9]]
AAB [[4, 5, 6], [8, 8, 10]]
BB []
CA [[7, 8]]
对于较长的事件系列来说,这可能需要很长时间,这取决于您在每个步骤中需要考虑多少个序列。这意味着序列的寿命越长,每次迭代就需要检查的次数越多。
SEQUENCES
的长度越长,每个步骤将引入的新序列越多。虽然以上因素最终定义了每个迭代步骤可能需要多长时间,但是可以进行一些优化。在每一步中,我们都要检查working_sequences
中所有当前不完整的序列,并检查新事件对它们的影响。但是,如果我们重做Sequence
类,则每次更新序列时,我们都可以计算下一个事件是什么。然后,在每个步骤中,我们都可以根据该事实对这些序列进行分类。这样,如果下一个事件是“A”,我们只检查是否接受该事件的任何序列。这也可以方便地拆分已完成或已过期的序列。
在第二和较少影响力最优化,预先计算所有与特定事件启动序列,以便不通过SEQUENCES
,每次必须进行迭代。
这应该避免任何不必要的检查并提高整体性能。但是,最坏的情况仍然与上面的简单版本相同。例如,如果90%的事件是“A”和90%的起始的事件或下一个事件的序列的是“A”,这将仍然需要90%的相比于之前的时间。
代码中的以下更改反映了这些优化。我还假定时间戳严格增加,因此可以简化依赖indices
属性的任何事情。
EXPIRED = '#'
COMPLETED = '='
class Sequence:
def __init__(self, seq, starting_time):
self.sequence = seq
self.pos = 0
self.times = [starting_time]
self.has_expired = False
self.next_event = self.next_event_query()
def is_next_event_acceptable(self, event, time):
if self.next_event != event:
return False
if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
self.has_expired = True
return False
return True
def update_sequence(self, event, time):
if self.is_next_event_acceptable(event, time):
self.pos += 1
self.times.append(time)
self.next_event = self.next_event_query()
def next_event_query(self):
if self.has_expired:
return EXPIRED
return COMPLETED if len(self.sequence) == self.pos + 1 else self.sequence[self.pos+1]
def __repr__(self):
seq = list(self.sequence)
seq.insert(self.pos, '[')
seq.insert(self.pos + 2, ']')
return ''.join(seq)
def find_non_overlapping_subsequences(events, timestamps):
unique_events = set(events)
starting_events = {}
for seq in SEQUENCES:
unique_events.update(seq)
first_event = seq[0]
if first_event not in starting_events:
starting_events[first_event] = []
starting_events[first_event].append(seq)
for e in unique_events:
if e not in starting_events:
starting_events[e] = []
all_symbols = ''.join(unique_events) + EXPIRED + COMPLETED
working_sequences = {event: [] for event in all_symbols}
next_event_lists = {event: [] for event in all_symbols}
results = {s: {'seq': [], 'last_time': timestamps[0]-1} for s in SEQUENCES}
for event, time in zip(events, timestamps):
next_event_lists[event] = []
for S in working_sequences[event]:
S.update_sequence(event, time)
next_event_lists[S.next_event].append(S)
for seq in starting_events[event]:
S = Sequence(seq, time)
next_event_lists[S.next_event].append(S)
for S in next_event_lists[COMPLETED]:
# Only add the sequence to the results if the timestamps
# don't overlap with the previous one
sequence, times = S.sequence, S.times
if results[sequence]['last_time'] < times[0]:
results[sequence]['seq'].append(times)
results[sequence]['last_time'] = times[-1]
next_event_lists[EXPIRED] = []
next_event_lists[COMPLETED] = []
working_sequences = next_event_lists.copy()
return results