Question

我目前有一个包含元组的列表。

overlap_list = [(10001656, 10001717), (700, 60000), (10001657, 10001718), (10001657, 10001716), (10031548, 10031643), (10031556, 10031656)]

我想要以下输出：

new_list=[(10001656, 10001717),(10001657, 10001718),(10001657, 10001716),(10031548, 10031643), (10031556, 10031656)]

元组中的数字是开始和结束边界。我想找到数字之间重叠的任何元组。

我已经尝试过找到的这段代码，并询问了类似的问题：

import itertools as ittools

def pairwise(iterable):
    a, b = ittools.tee(iterable)
    next(b, None)
    return zip(a, b)

overlap_list = [(10001656, 10001717), (700, 60000), (10001657, 10001718), (10001657, 10001716), (10031548, 10031643), (10031556, 10031656)]
print([list(p) for k, p in it.groupby(pairwise(overlap_list), lambda x: x[0][0] < x[1][0] < x[0][1]) if k])

但这给出了：

[[((10031548, 10031643), (10031556, 10031656))]]

我已经研究了不同的解决方案，但是我面临的问题是以前位置的索引似乎不起作用。

如何获得所需的输出？任何帮助将不胜感激。

Answer 1

老实说-我不太了解您的代码及其背后的思想，因此无法告诉您为什么结果仅包含所需元组的子集。

但是，我有另一种方法，您可能会觉得有趣。
主要思想是拥有一个可以测试两个元组是否重叠的功能。此功能适用于overlap_list中所有元组的组合。如果两个重叠，则将它们添加到结果列表中，该列表随后将包含重复项，因此最后使用list(set(result))。 但是，您可以省去转换到列表的位置，因为无论如何，设置都可以...

测试功能的想法是简单地对要测试的两个元组的4个值进行排序，并查看排序顺序（请参见numpy.argsort）。如果前两个索引为0/1或2/3，则两个元组不重叠。
换句话说：经过测试以防>1，它们必须不相等，即不必须同时是True或False：

def overlap_test(tpl1, tpl2):
    import numpy as np
    a, b = np.argsort(tpl1 + tpl2)[:2] > 1
    return a != b

这是使用该函数的循环：

import itertools as it
result = []
for test_tpl, sec_tpl in list(it.combinations(overlap_list, 2)):
    if overlap_test(test_tpl, sec_tpl):
        result.extend([test_tpl, sec_tpl])
result = list(set(result))

# [(10001657, 10001718),
#  (10031556, 10031656),
#  (10031548, 10031643),
#  (10001657, 10001716),
#  (10001656, 10001717)]

我仍然想知道循环是否不会更有效率，并且以这种方式是否也无法解决对set的需求-好吧，也许您找到了一个更好的循环。

编辑：

到目前为止，并没有发现有什么不同，但是有一点改进：

相同的方法，但从一开始就使用set：

def find_overlap_tuples_0(tpl_list):
    result = set()
    for test_tpl, sec_tpl in list(it.combinations(tpl_list, 2)):
        if overlap_test(test_tpl, sec_tpl):
            result.add(test_tpl)
            result.add(sec_tpl)
    return list(result)

# %timeit find_overlap_tuples_0(overlap_list)
# 178 µs ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

有所不同，仅基于排列和分组（似乎要快一些）：

def find_overlap_tuples_1(tpl_list):
    result = set()
    no_ovl = set()
    for a, grp in it.groupby(it.permutations(tpl_list, 2), lambda x: x[0]):
        for b in grp:
            if (a not in result) and (b[1] not in no_ovl):
                if overlap_test(*b):
                    result.add(b[0])
                    result.add(b[1])
                    break
                no_ovl.add(b[0])
    return list(result)

# %timeit find_overlap_tuples_1(overlap_list)
# 139 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Answer 2

似乎您可以对列表进行排序，以便任何重叠的开始和终止都相邻，然后仅比较邻居以确定是否由于非重叠而需要滤除任何元组（在代码末尾进行排序）并不是必须的，只是可以更轻松地在打印输出中看到重叠的邻居。

l = [(10001656, 10001717), (700, 60000), (10001657, 10001718), (10001657, 10001716), (10031548, 10031643), (10031556, 10031656)]

l.sort()
overlap = set()
for a, b in zip(l, l[1:]):
    if a[1] >= b[0] and a[1] <= b[1]:
        overlap.add(a)
    if b[0] >= a[0] and b[0] <= a[1]:
        overlap.add(b)

overlap = sorted(overlap)        
print(overlap)
# [(10001657, 10001716), (10001657, 10001718), (10031548, 10031643), (10031556, 10031656)]

如何在列表中查找重叠的元组并返回重叠的元组

2 个答案: