Question

我有以下结构的列表：列表的结构：[(id,start, end), (id,start, end), (id,start, end)]

例如，它们可能如下所示：

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

我需要找到它们之间的重叠区域。

我尝试使用以下代码进行暴力破解方法：

for a1, s1, e1 in List1:
 for a2, s2, e2 in List2:
    sgroup = [s1, s2]
    egroup = [e1, e2]    
    mstart = max(sgroup)
    mend = min(egroup)
    if a1 == a2 and e2>=s1 and s2<=e1:
        t = (mstart, mend)
        print(t)

有人可以帮助我加快速度吗？我需要一种算法来比这种暴力方法更快地工作。

Answer 1

snippet 2

[编辑]：测量时间：

for a1, s1, e1 in List1:
    for a2, s2, e2 in List2:
        if a1 == a2 and s2 <= e1 and e2 >= s1:
            print (max(s1, s2), min(e1, e2))

输出：

import time 

def group1():
    res = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            sgroup = [s1, s2]
            egroup = [e1, e2]    
            mstart = max(sgroup)
            mend = min(egroup)
            if a1 == a2 and e2>=s1 and s2<=e1:
                t = (mstart, mend)
                res.append(t)
    return res

def group2():
    res = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            if a1 == a2 and s2 <= e1 and e2 >= s1:
                res.append((max(s1, s2), min(e1, e2)))
    return res

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

for func in [group1, group2]:
    start = time.time()
    func()
    end = time.time()
    print(f'{func.__name__}: {end - start}')
    print(func())

Answer 2

看来，您已经可以保证列表按ID和开始时间排序，并且没有重叠。本质上，每个ID都是用于重叠检测的单独列表。

您将像完成列表合并（本质上是这样）一样，浏览列表对。在每个列表中都有一个索引（index_a，index_b）；在每次迭代中，您将使用两个引用值中的较小者（位于索引处的起始值）。

要处理一个间隔-list_a [index_a]中的元素：

获取其最终值。
设置check_b = index_b
- 将此与另一个列表中项目的起始值进行比较。
- 当end [index a]> start [check_b]时，则有重叠；报告它。
- 增加check_b。
增加index_a。
使用当前list_b项目检查新的list_a项目；选择起始值较低的那个，然后返回到此过程的顶部（第一个项目符号）。

Answer 3

如果您想使用列表理解功能，可以这样做：

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]                                                
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]                                                                      

output = [(max(s1, s2), min(e1, e2)) for id1, s1, e1 in List1 
                                     for id2, s2, e2 in List2                                                               
                                     if id1 == id2 and e2 >= s1 and s2 <= e1]

print(output)

输出为：

[(50, 56), (61, 64), (70, 81), (84, 87), (90, 90), (99, 99), (110, 117), (119, 126), (5, 9), (11, 15), (3, 9)]

似乎与原始答案相同。通常，列表理解将比标准循环快。

列表理解： 6.51 µs ：每个循环±13.7 ns（平均±标准偏差，共运行7次，每个循环100000次）

原始循环： 25.2 µs ：每个循环±239 ns（平均±标准偏差，共运行7次，每个循环10000次）

（“原始循环”的比较代码为：

    output = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            sgroup = [s1, s2]
            egroup = [e1, e2]
            mstart = max(sgroup)
            mend = min(egroup)
            if a1 == a2 and e2>=s1 and s2<=e1:
                output.append((mstart, mend))

）

Answer 4

好吧……这是极端的矫kill过正（我很开心，别管我！），但是它的运行速度比简单算法快约三倍。

这种方法的真正优势在于它可以处理非常大的列表而不会减慢速度，并且不会在内存中保留任何内容。因此，尝试使用其中包含100或1000项内容的列表，您应该会看到更大的改进。

我不假设列表已排序，因此假设Python的排序算法良好，时间应以O(n.log(n))排序为主。

from itertools import groupby
from operator import itemgetter


def get_list_overlaps(list_a, list_b):
    for range_id, (a_ranges, b_ranges) in align_lists(list_a, list_b):
        a_range = next(a_ranges)
        b_range = next(b_ranges)

        try:
            while a_range and b_range:
                overlap = get_overlap(a_range, b_range)
                if overlap:
                    yield overlap

                    # If we overlap, discard the one which ends earliest
                    if a_range[2] < b_range[2]:
                        a_range = next(a_ranges)
                    else:
                        b_range = next(b_ranges)

                else:
                    # If not, discard the one which starts earliest
                    if a_range[1] < b_range[1]:
                        a_range = next(a_ranges)
                    else:
                        b_range = next(b_ranges)

    except StopIteration:
            continue

def align_lists(list_a, list_b):
    b_grouped = groupby(sorted(list_b), key=itemgetter(0))
    b_id, b_intervals = next(b_grouped)

    for a_id, a_intervals in groupby(sorted(list_a), key=itemgetter(0)):
        # Work until our lists line up
        if a_id < b_id:
            continue

        try:
            while a_id > b_id:
                b_id, b_intervals = next(b_grouped)
        except StopIteration:
            break

        yield a_id, (a_intervals, b_intervals)


def get_overlap(a_range, b_range):
    _, a_start, a_end = a_range
    _, b_start, b_end = b_range

    # If either ends before the other starts, no overlap
    if b_end < a_start or a_end < b_start:
        return

    return max(a_start, b_start), min(a_end, b_end)

# -------------------------------------------------------------------- #

List1 = [(1, 50, 56), (1, 61, 69), (1, 70, 87), (1, 90, 99), (1, 110, 117),
     (1, 119, 126), (2, 3, 9), (2, 11, 17), (3, 2, 9)]
List2 = [(1, 44, 56), (1, 59, 64), (1, 70, 81), (1, 84, 90), (1, 99, 155),
     (2, 5, 15), (3, 3, 9)]

for overlap in get_list_overlaps(List1, List2):
    print(overlap)

我们的想法是先对ID进行分组，因此我们再也不必费心比较具有不同ID的事物，然后再遍历每个ID中的事物，在做和不重叠时将其扔掉。

您可能可以通过内联某些功能等来优化此功能。

Answer 5

这里有几点注意事项：

范围不能重叠。这意味着我们可以根据ID对问题进行分区，然后忽略ID分别解决每个分区。
如果对列表进行排序，则无需对照所有范围检查所有范围，而是可以在看到e1

您在这里寻找的是一种滑动窗口算法，与您在TCP中看到的不太相似。

因此，将它们放在一起：

from itertools import groupby

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

def partition(lst):
    part = groupby(lst, lambda el: el[0])
    return {id: list(el) for id, el in part}

# you may be able to skip sorted() here if you know your input is already sorted
List1_part = partition(sorted(List1))
List2_part = partition(sorted(List2))

for id in set(List1_part) & set(List2_part):
    window_size = max((e-s) for _, s, e in List2_part[id])
    window = 0
    for r1 in List1_part[id]:
        for r2 in List2_part[id][window:]:
            _, s1, e1 = r1
            _, s2, e2 = r2
            if e1 < s2:
                break
            elif e2 >= s1:
                print(id, max(s1, s2), min(e1, e2))
            elif s2 + window_size < s1: 
                window += 1

python列表中的常见元素

5 个答案:

[编辑]：测量时间：

输出：