Question

我有一个像这样的元素列表：

new_element = {'start':start, 'end':end, 'category':cat, 'value': val}

现在，我想将它附加到列表中，只有在没有其他元素已经包含这个新元素的情况下（按开始，结束和类别检查）。

此外，如果此元素包含已在列表中的元素，我想添加它并删除旧元素。

总之，我不想要嵌套元素，我只想保留更大的元素。

到目前为止（id是类别）：

for ir in irs[:]:
    #is it nested into another?
    if ir['category'] == ir_new['category'] and ir['start'] <= ir_new['start'] and ir['end'] >= ir_new['end']:
        nested = True
    #another is nested in this one
    if ir['category'] == ir_new['category'] and ir['start'] >= ir_new['start'] and ir['end'] <= ir_new['end']:
        irs.remove(ir)
if not nested:
    #append in a list
    irs.append(ir_new)
    found += 1

这是有效的，我认为它是O（n * n）。也许有另一种方法可以通过使用dicts或pandas来提高效率。

一些想法：

我应该在追加或追加所有内容之前进行检查吗？

更新1：在这个lib中有一个区间树的实现，唯一的问题是一旦添加就无法删除intervales。

http://bx-python.readthedocs.io/en/latest/lib/bx.intervals.intersection.html#bx.intervals.intersection.IntervalTree

更新2： https://github.com/chaimleib/intervaltree很有意思，就是我在删除部分重叠时无法恢复。所以我只需要完全重叠/嵌套

Answer 1

草稿：

使用'start'

为__lt__定义“ir”项的类

有一个以类别为主键的主dict

存储具有该类别的项目的排序列表（bisect）

当您根据开始时间找到插入位置时，您可以开始比较结束时间，直到找到不应删除的“ir”项目。

Answer 2

通过使用pandas库和一些编码，我得到了一个不错的解决方案

初始化 ...

df = pd.DataFrame(columns=['start','end','seq','record','len','ir_1','ir_2'])

添加 ...

with l_lock:
    new_element = [ir_start, ir_end,ir_seq, record.id, ir_len, seq_q, seq_q_prime]
    df.loc[len(df)] = new_element

删除重复项 ...

for idx, row in df.iterrows():
    res = df[(df.index != idx) & (df.start >= row.start) & (df.end <= row.end)]
    df.drop(res.index,inplace=True)

正如在一些评论中所建议的那样，区间树也是一种可能的解决方案，但我无法正常工作

算法：向列表添加向量，同时避免嵌套

2 个答案: