在具有时间限制的事件数据中查找子序列的出现

时间:2019-01-31 15:16:21

标签: python

我具有以下格式的事件数据:

event     A A A A A C B C D A A A B
timestamp 0 3 4 4 5 5 6 7 7 8 8 9 10

给定序列S和事件E的列表,如何有效地找到在时间窗口{内SE的{​​{1}}的不重叠情况{1}},并且事件中的每个事件都在距上一个事件W的间隔内?

示例结果为L

S = {A, AA, AAA, AAB, BB, CA}, W=3, L=2

如您所见,事件不必是连续的(即序列中的所有元素都出现在序列中)。时间戳仅显示为整数。

1 个答案:

答案 0 :(得分:3)

可以用一个传过来的数据,如果你跟踪有效迄今为止在不完全的子序列,一旦他们完成或不能完成任何更忘记他们所做的。为此,我编写了一个Sequence类来跟踪

  • 序列名称
  • ,其中它的事件发生时的指数,弄清楚是否它与以前完成序列重叠
  • 事件发生的时间,因为这是我们的输出,因此我们需要它们检查约束条件
  • 当前在序列名称中的位置,以便我们知道下一个事件应该是什么以及序列何时完成,以及
  • 的标志忘记序列,如果它超出了我们的窗口/长度约束。

代码

events = 'AAAAACBCDAAAB'
timestamps = [0, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 10]

SEQUENCES = {'A', 'AA', 'AAA', 'AAB', 'BB', 'CA'}
WINDOW = 3
LENGTH = 2

class Sequence:
    def __init__(self, seq, starting_index, starting_time):
        self.sequence = seq
        self.pos = 0
        self.indices = [starting_index]
        self.times = [starting_time]
        self.has_expired = False

    def is_next_event_acceptable(self, event, time):
        if self.sequence[self.pos+1] != event:
            return False
        else:
            if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
                self.has_expired = True
                return False
            return True

    def add_event_if_acceptable(self, event, index, time):
        if self.is_next_event_acceptable(event, time):
            self.pos += 1
            self.indices.append(index)
            self.times.append(time)

    def is_complete(self):
        return len(self.sequence) == self.pos + 1

    def __repr__(self):
        seq = list(self.sequence)
        seq.insert(self.pos, '[')
        seq.insert(self.pos + 2, ']')
        return ''.join(seq)


def find_non_overlapping_subsequences(events, timestamps):
    working_sequences = []
    results = {s: {'seq': [], 'last_index': -1} for s in SEQUENCES}

    for index, (event, time) in enumerate(zip(events, timestamps)):
        # First work with any present sequences in the queue
        # and then introduce any new ones
        for Seq in working_sequences:
            Seq.add_event_if_acceptable(event, index, time)
        for seq in SEQUENCES:
            if seq.startswith(event):
                working_sequences.append(Sequence(seq, index, time))
        # Any successfully completed sequences, or sequences
        # that can't be completed anymore are to be removed
        seq_idx_to_remove = []
        for i, Seq in enumerate(working_sequences):
            if Seq.has_expired:
                seq_idx_to_remove.append(i)
            elif Seq.is_complete():
                seq_idx_to_remove.append(i)
                # Only add the sequence to the results if the indices
                # aren't overlapping with the previous one
                sequence, times, indices = Seq.sequence, Seq.times, Seq.indices
                if results[sequence]['last_index'] < indices[0]:
                    results[sequence]['seq'].append(times)
                    results[sequence]['last_index'] = indices[-1]
        # We must remove the items in reverse order so that
        # we don't disturb the 'forward' ordering
        for i in seq_idx_to_remove[::-1]:
            del working_sequences[i]

    return results

results = find_non_overlapping_subsequences(events, timestamps)
for key, value in sorted(results.items()):
    print(key, value['seq'])

输出

A [[0], [3], [4], [4], [5], [8], [8], [9]]
AA [[3, 4], [4, 5], [8, 8]]
AAA [[3, 4, 4], [8, 8, 9]]
AAB [[4, 5, 6], [8, 8, 10]]
BB []
CA [[7, 8]]

更新

对于较长的事件系列来说,这可能需要很长时间,这取决于您在每个步骤中需要考虑多少个序列。这意味着序列的寿命越长,每次迭代就需要检查的次数越多。

  • 如果一个序列需要完成更多的事件,则将需要更多的迭代。
  • SEQUENCES的长度越长,每个步骤将引入的新序列越多。
  • 如果窗口或长度持续时间较长,则序列在过期之前会存活更长的时间。
  • 您拥有的独特事件越多(如果它们在系列中统一出现),则完成给定序列所需的时间就越长。例如,如果在每次迭代中您仅遇到As和Bs而不是字母中的任何字母,则序列“ AB”将更快地完成。

虽然以上因素最终定义了每个迭代步骤可能需要多长时间,但是可以进行一些优化。在每一步中,我们都要检查working_sequences中所有当前不完整的序列,并检查新事件对它们的影响。但是,如果我们重做Sequence类,则每次更新序列时,我们都可以计算下一个事件是什么。然后,在每个步骤中,我们都可以根据该事实对这些序列进行分类。这样,如果下一个事件是“A”,我们只检查是否接受该事件的任何序列。这也可以方便地拆分已完成或已过期的序列。

在第二和较少影响力最优化,预先计算所有与特定事件启动序列,以便不通过SEQUENCES,每次必须进行迭代。

这应该避免任何不必要的检查并提高整体性能。但是,最坏的情况仍然与上面的简单版本相同。例如,如果90%的事件是“A”和90%的起始的事件或下一个事件的序列的是“A”,这将仍然需要90%的相比于之前的时间。

代码中的以下更改反映了这些优化。我还假定时间戳严格增加,因此可以简化依赖indices属性的任何事情。

EXPIRED = '#'
COMPLETED = '='

class Sequence:
    def __init__(self, seq, starting_time):
        self.sequence = seq
        self.pos = 0
        self.times = [starting_time]
        self.has_expired = False
        self.next_event = self.next_event_query()

    def is_next_event_acceptable(self, event, time):
        if self.next_event != event:
            return False
        if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
            self.has_expired = True
            return False
        return True

    def update_sequence(self, event, time):
        if self.is_next_event_acceptable(event, time):
            self.pos += 1
            self.times.append(time)
        self.next_event = self.next_event_query()

    def next_event_query(self):
        if self.has_expired:
            return EXPIRED
        return COMPLETED if len(self.sequence) == self.pos + 1 else self.sequence[self.pos+1]

    def __repr__(self):
        seq = list(self.sequence)
        seq.insert(self.pos, '[')
        seq.insert(self.pos + 2, ']')
        return ''.join(seq)


def find_non_overlapping_subsequences(events, timestamps): 
    unique_events = set(events)
    starting_events = {}
    for seq in SEQUENCES:
        unique_events.update(seq)
        first_event = seq[0]
        if first_event not in starting_events:
            starting_events[first_event] = []
        starting_events[first_event].append(seq)
    for e in unique_events:
        if e not in starting_events:
            starting_events[e] = []

    all_symbols = ''.join(unique_events) + EXPIRED + COMPLETED
    working_sequences = {event: [] for event in all_symbols}
    next_event_lists = {event: [] for event in all_symbols}
    results = {s: {'seq': [], 'last_time': timestamps[0]-1} for s in SEQUENCES}

    for event, time in zip(events, timestamps):
        next_event_lists[event] = []
        for S in working_sequences[event]:
            S.update_sequence(event, time)
            next_event_lists[S.next_event].append(S)
        for seq in starting_events[event]:
            S = Sequence(seq, time)
            next_event_lists[S.next_event].append(S)
        for S in next_event_lists[COMPLETED]:
            # Only add the sequence to the results if the timestamps
            # don't overlap with the previous one
            sequence, times = S.sequence, S.times
            if results[sequence]['last_time'] < times[0]:
                results[sequence]['seq'].append(times)
                results[sequence]['last_time'] = times[-1]
        next_event_lists[EXPIRED] = []
        next_event_lists[COMPLETED] = []
        working_sequences = next_event_lists.copy()

    return results