更新

Question

我具有以下格式的事件数据：

event     A A A A A C B C D A A A B
timestamp 0 3 4 4 5 5 6 7 7 8 8 9 10

给定序列S和事件E的列表，如何有效地找到在时间窗口{内S中E的{{1}}的不重叠情况{1}}，并且事件中的每个事件都在距上一个事件W的间隔内？

示例结果为L：

S = {A, AA, AAA, AAB, BB, CA}, W=3, L=2

如您所见，事件不必是连续的（即序列中的所有元素都出现在序列中）。时间戳仅显示为整数。

Answer 1

可以用一个传过来的数据，如果你跟踪有效迄今为止在不完全的子序列，一旦他们完成或不能完成任何更忘记他们所做的。为此，我编写了一个Sequence类来跟踪

序列名称
，其中它的事件发生时的指数，弄清楚是否它与以前完成序列重叠
事件发生的时间，因为这是我们的输出，因此我们需要它们检查约束条件
当前在序列名称中的位置，以便我们知道下一个事件应该是什么以及序列何时完成，以及
的标志忘记序列，如果它超出了我们的窗口/长度约束。

代码

events = 'AAAAACBCDAAAB'
timestamps = [0, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 10]

SEQUENCES = {'A', 'AA', 'AAA', 'AAB', 'BB', 'CA'}
WINDOW = 3
LENGTH = 2

class Sequence:
    def __init__(self, seq, starting_index, starting_time):
        self.sequence = seq
        self.pos = 0
        self.indices = [starting_index]
        self.times = [starting_time]
        self.has_expired = False

    def is_next_event_acceptable(self, event, time):
        if self.sequence[self.pos+1] != event:
            return False
        else:
            if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
                self.has_expired = True
                return False
            return True

    def add_event_if_acceptable(self, event, index, time):
        if self.is_next_event_acceptable(event, time):
            self.pos += 1
            self.indices.append(index)
            self.times.append(time)

    def is_complete(self):
        return len(self.sequence) == self.pos + 1

    def __repr__(self):
        seq = list(self.sequence)
        seq.insert(self.pos, '[')
        seq.insert(self.pos + 2, ']')
        return ''.join(seq)


def find_non_overlapping_subsequences(events, timestamps):
    working_sequences = []
    results = {s: {'seq': [], 'last_index': -1} for s in SEQUENCES}

    for index, (event, time) in enumerate(zip(events, timestamps)):
        # First work with any present sequences in the queue
        # and then introduce any new ones
        for Seq in working_sequences:
            Seq.add_event_if_acceptable(event, index, time)
        for seq in SEQUENCES:
            if seq.startswith(event):
                working_sequences.append(Sequence(seq, index, time))
        # Any successfully completed sequences, or sequences
        # that can't be completed anymore are to be removed
        seq_idx_to_remove = []
        for i, Seq in enumerate(working_sequences):
            if Seq.has_expired:
                seq_idx_to_remove.append(i)
            elif Seq.is_complete():
                seq_idx_to_remove.append(i)
                # Only add the sequence to the results if the indices
                # aren't overlapping with the previous one
                sequence, times, indices = Seq.sequence, Seq.times, Seq.indices
                if results[sequence]['last_index'] < indices[0]:
                    results[sequence]['seq'].append(times)
                    results[sequence]['last_index'] = indices[-1]
        # We must remove the items in reverse order so that
        # we don't disturb the 'forward' ordering
        for i in seq_idx_to_remove[::-1]:
            del working_sequences[i]

    return results

results = find_non_overlapping_subsequences(events, timestamps)
for key, value in sorted(results.items()):
    print(key, value['seq'])

输出

A [[0], [3], [4], [4], [5], [8], [8], [9]]
AA [[3, 4], [4, 5], [8, 8]]
AAA [[3, 4, 4], [8, 8, 9]]
AAB [[4, 5, 6], [8, 8, 10]]
BB []
CA [[7, 8]]

更新

对于较长的事件系列来说，这可能需要很长时间，这取决于您在每个步骤中需要考虑多少个序列。这意味着序列的寿命越长，每次迭代就需要检查的次数越多。

如果一个序列需要完成更多的事件，则将需要更多的迭代。
SEQUENCES的长度越长，每个步骤将引入的新序列越多。
如果窗口或长度持续时间较长，则序列在过期之前会存活更长的时间。
您拥有的独特事件越多（如果它们在系列中统一出现），则完成给定序列所需的时间就越长。例如，如果在每次迭代中您仅遇到As和Bs而不是字母中的任何字母，则序列“ AB”将更快地完成。

虽然以上因素最终定义了每个迭代步骤可能需要多长时间，但是可以进行一些优化。在每一步中，我们都要检查working_sequences中所有当前不完整的序列，并检查新事件对它们的影响。但是，如果我们重做Sequence类，则每次更新序列时，我们都可以计算下一个事件是什么。然后，在每个步骤中，我们都可以根据该事实对这些序列进行分类。这样，如果下一个事件是“A”，我们只检查是否接受该事件的任何序列。这也可以方便地拆分已完成或已过期的序列。

在第二和较少影响力最优化，预先计算所有与特定事件启动序列，以便不通过SEQUENCES，每次必须进行迭代。

这应该避免任何不必要的检查并提高整体性能。但是，最坏的情况仍然与上面的简单版本相同。例如，如果90％的事件是“A”和90％的起始的事件或下一个事件的序列的是“A”，这将仍然需要90％的相比于之前的时间。

代码中的以下更改反映了这些优化。我还假定时间戳严格增加，因此可以简化依赖indices属性的任何事情。

EXPIRED = '#'
COMPLETED = '='

class Sequence:
    def __init__(self, seq, starting_time):
        self.sequence = seq
        self.pos = 0
        self.times = [starting_time]
        self.has_expired = False
        self.next_event = self.next_event_query()

    def is_next_event_acceptable(self, event, time):
        if self.next_event != event:
            return False
        if time - self.times[0] > WINDOW or time - self.times[-1] > LENGTH:
            self.has_expired = True
            return False
        return True

    def update_sequence(self, event, time):
        if self.is_next_event_acceptable(event, time):
            self.pos += 1
            self.times.append(time)
        self.next_event = self.next_event_query()

    def next_event_query(self):
        if self.has_expired:
            return EXPIRED
        return COMPLETED if len(self.sequence) == self.pos + 1 else self.sequence[self.pos+1]

    def __repr__(self):
        seq = list(self.sequence)
        seq.insert(self.pos, '[')
        seq.insert(self.pos + 2, ']')
        return ''.join(seq)


def find_non_overlapping_subsequences(events, timestamps): 
    unique_events = set(events)
    starting_events = {}
    for seq in SEQUENCES:
        unique_events.update(seq)
        first_event = seq[0]
        if first_event not in starting_events:
            starting_events[first_event] = []
        starting_events[first_event].append(seq)
    for e in unique_events:
        if e not in starting_events:
            starting_events[e] = []

    all_symbols = ''.join(unique_events) + EXPIRED + COMPLETED
    working_sequences = {event: [] for event in all_symbols}
    next_event_lists = {event: [] for event in all_symbols}
    results = {s: {'seq': [], 'last_time': timestamps[0]-1} for s in SEQUENCES}

    for event, time in zip(events, timestamps):
        next_event_lists[event] = []
        for S in working_sequences[event]:
            S.update_sequence(event, time)
            next_event_lists[S.next_event].append(S)
        for seq in starting_events[event]:
            S = Sequence(seq, time)
            next_event_lists[S.next_event].append(S)
        for S in next_event_lists[COMPLETED]:
            # Only add the sequence to the results if the timestamps
            # don't overlap with the previous one
            sequence, times = S.sequence, S.times
            if results[sequence]['last_time'] < times[0]:
                results[sequence]['seq'].append(times)
                results[sequence]['last_time'] = times[-1]
        next_event_lists[EXPIRED] = []
        next_event_lists[COMPLETED] = []
        working_sequences = next_event_lists.copy()

    return results

在具有时间限制的事件数据中查找子序列的出现

1 个答案:

更新