python列表中的常见元素

时间:2019-06-07 18:00:07

标签: python list

我有以下结构的列表: 列表的结构:[(id,start, end), (id,start, end), (id,start, end)]

例如,它们可能如下所示:

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

我需要找到它们之间的重叠区域。

我尝试使用以下代码进行暴力破解方法:

for a1, s1, e1 in List1:
 for a2, s2, e2 in List2:
    sgroup = [s1, s2]
    egroup = [e1, e2]    
    mstart = max(sgroup)
    mend = min(egroup)
    if a1 == a2 and e2>=s1 and s2<=e1:
        t = (mstart, mend)
        print(t)

有人可以帮助我加快速度吗?我需要一种算法来比这种暴力方法更快地工作。

5 个答案:

答案 0 :(得分:1)

snippet 2

[编辑]:测量时间:

for a1, s1, e1 in List1:
    for a2, s2, e2 in List2:
        if a1 == a2 and s2 <= e1 and e2 >= s1:
            print (max(s1, s2), min(e1, e2))

输出:

import time 

def group1():
    res = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            sgroup = [s1, s2]
            egroup = [e1, e2]    
            mstart = max(sgroup)
            mend = min(egroup)
            if a1 == a2 and e2>=s1 and s2<=e1:
                t = (mstart, mend)
                res.append(t)
    return res

def group2():
    res = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            if a1 == a2 and s2 <= e1 and e2 >= s1:
                res.append((max(s1, s2), min(e1, e2)))
    return res

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

for func in [group1, group2]:
    start = time.time()
    func()
    end = time.time()
    print(f'{func.__name__}: {end - start}')
    print(func())

答案 1 :(得分:0)

看来,您已经可以保证列表按ID和开始时间排序,并且没有重叠。本质上,每个ID都是用于重叠检测的单独列表。

您将像完成列表合并(本质上是这样)一样,浏览列表对。在每个列表中都有一个索引(index_a,index_b);在每次迭代中,您将使用两个引用值中的较小者(位于索引处的起始值)。

要处理一个间隔-list_a [index_a]中的元素:

  • 获取其最终值。
  • 设置check_b = index_b
    • 将此与另一个列表中项目的起始值进行比较。
    • 当end [index a]> start [check_b]时,则有重叠;报告它。
    • 增加check_b
  • 增加index_a。
  • 使用当前list_b项目检查新的list_a项目;选择起始值较低的那个,然后返回到此过程的顶部(第一个项目符号)。

答案 2 :(得分:0)

如果您想使用列表理解功能,可以这样做:

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]                                                
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]                                                                      

output = [(max(s1, s2), min(e1, e2)) for id1, s1, e1 in List1 
                                     for id2, s2, e2 in List2                                                               
                                     if id1 == id2 and e2 >= s1 and s2 <= e1]

print(output)

输出为:

[(50, 56), (61, 64), (70, 81), (84, 87), (90, 90), (99, 99), (110, 117), (119, 126), (5, 9), (11, 15), (3, 9)]                                       

似乎与原始答案相同。通常,列表理解将比标准循环快。

列表理解: 6.51 µs :每个循环±13.7 ns(平均±标准偏差,共运行7次,每个循环100000次)

原始循环: 25.2 µs :每个循环±239 ns(平均±标准偏差,共运行7次,每个循环10000次)

(“原始循环”的比较代码为:

    output = []
    for a1, s1, e1 in List1:
        for a2, s2, e2 in List2:
            sgroup = [s1, s2]
            egroup = [e1, e2]
            mstart = max(sgroup)
            mend = min(egroup)
            if a1 == a2 and e2>=s1 and s2<=e1:
                output.append((mstart, mend))

答案 3 :(得分:0)

好吧……这是极端的矫kill过正(我很开心,别管我!),但是它的运行速度比简单算法快约三倍。

这种方法的真正优势在于它可以处理非常大的列表而不会减慢速度,并且不会在内存中保留任何内容。因此,尝试使用其中包含100或1000项内容的列表,您应该会看到更大的改进。

我不假设列表已排序,因此假设Python的排序算法良好,时间应以O(n.log(n))排序为主。

from itertools import groupby
from operator import itemgetter


def get_list_overlaps(list_a, list_b):
    for range_id, (a_ranges, b_ranges) in align_lists(list_a, list_b):
        a_range = next(a_ranges)
        b_range = next(b_ranges)

        try:
            while a_range and b_range:
                overlap = get_overlap(a_range, b_range)
                if overlap:
                    yield overlap

                    # If we overlap, discard the one which ends earliest
                    if a_range[2] < b_range[2]:
                        a_range = next(a_ranges)
                    else:
                        b_range = next(b_ranges)

                else:
                    # If not, discard the one which starts earliest
                    if a_range[1] < b_range[1]:
                        a_range = next(a_ranges)
                    else:
                        b_range = next(b_ranges)

    except StopIteration:
            continue

def align_lists(list_a, list_b):
    b_grouped = groupby(sorted(list_b), key=itemgetter(0))
    b_id, b_intervals = next(b_grouped)

    for a_id, a_intervals in groupby(sorted(list_a), key=itemgetter(0)):
        # Work until our lists line up
        if a_id < b_id:
            continue

        try:
            while a_id > b_id:
                b_id, b_intervals = next(b_grouped)
        except StopIteration:
            break

        yield a_id, (a_intervals, b_intervals)


def get_overlap(a_range, b_range):
    _, a_start, a_end = a_range
    _, b_start, b_end = b_range

    # If either ends before the other starts, no overlap
    if b_end < a_start or a_end < b_start:
        return

    return max(a_start, b_start), min(a_end, b_end)

# -------------------------------------------------------------------- #

List1 = [(1, 50, 56), (1, 61, 69), (1, 70, 87), (1, 90, 99), (1, 110, 117),
     (1, 119, 126), (2, 3, 9), (2, 11, 17), (3, 2, 9)]
List2 = [(1, 44, 56), (1, 59, 64), (1, 70, 81), (1, 84, 90), (1, 99, 155),
     (2, 5, 15), (3, 3, 9)]

for overlap in get_list_overlaps(List1, List2):
    print(overlap)

我们的想法是先对ID进行分组,因此我们再也不必费心比较具有不同ID的事物,然后再遍历每个ID中的事物,在做和不重叠时将其扔掉。

您可能可以通过内联某些功能等来优化此功能。

答案 4 :(得分:0)

这里有几点注意事项:

    具有不同ID的
  1. 范围不能重叠。这意味着我们可以根据ID对问题进行分区,然后忽略ID分别解决每个分区。

  2. 如果对列表进行排序,则无需对照所有范围检查所有范围,而是可以在看到e1

您在这里寻找的是一种滑动窗口算法,与您在TCP中看到的不太相似。

因此,将它们放在一起:

from itertools import groupby

List1 = [(1,50,56),(1,61,69),(1,70,87),(1,90,99),(1,110,117),(1,119,126),(2,3,9), (2,11,17), (3,2,9)]
List2 = [(1,44,56),(1,59,64),(1,70,81),(1,84,90),(1,99,155), (2,5,15), (3,3,9)]

def partition(lst):
    part = groupby(lst, lambda el: el[0])
    return {id: list(el) for id, el in part}

# you may be able to skip sorted() here if you know your input is already sorted
List1_part = partition(sorted(List1))
List2_part = partition(sorted(List2))

for id in set(List1_part) & set(List2_part):
    window_size = max((e-s) for _, s, e in List2_part[id])
    window = 0
    for r1 in List1_part[id]:
        for r2 in List2_part[id][window:]:
            _, s1, e1 = r1
            _, s2, e2 = r2
            if e1 < s2:
                break
            elif e2 >= s1:
                print(id, max(s1, s2), min(e1, e2))
            elif s2 + window_size < s1: 
                window += 1