Question

我遇到了一个问题，希望有人能给我一个提示来克服它。

我有一个2D-python列表（83行和3列）。前两列是间隔的起始位置和结束位置。第3列是数字索引（例如：9.68）。该列表按第3列反向排序。我希望获得具有最高索引的所有非重叠区间。

以下是排序列表的示例：

504 789 9.68
503 784 9.14
505 791 8.78
499 798 8.73
1024 1257 7.52
1027 1305 7.33
507 847 5.86

以下是我的尝试：

# Define a function that test if 2 intervals overlap
def overlap(start1, end1, start2, end2):
        return not (end1 < start2 or end2 < start1)

best_list = [] # Create a list that will store the best intervals
best_list.append([sort[0][0],sort[0][1]]) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
    local_start, local_end = line.rsplit("\s",1)[0].split()
    for i in range(len(best_list)):
        best_start = best_list[i][0]
        best_end = best_list[i][1]
        test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
        if test is False:
            best_list.append([local_start, local_end])

我得到了：

best_list = [(504, 789),(1024, 1257),(1027, 1305)]

但我想：

best_list = [(504, 789),(1024, 1257)]

谢谢！

Answer 1

好吧，我对你的代码有一些疑问。由于sort包含字符串，因此此行append([sort[0][0],sort[0][1]])可以满足您的期望吗？

无论如何，对于主要部分，您的问题是当列表中存在多个元素时，只需其中一个元素就可以通过重叠测试将其添加到列表中（而不是您想要的）。例如。当两个(504, 789),(1024, 1257)都存在时，(1027, 1305)将插入到列表中，因为它在与(504, 789)进行比较时通过了测试。

所以，我做了一些改动，现在似乎按预期工作了：

best_list = [] # Create a list that will store the best intervals
best_list.append(sort[0].rsplit(" ", 1)[0].split()) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
    local_start, local_end = line.rsplit("\s", 1)[0].split()
    flag = False # <- flag to check the overall overlapping
    for i in range(len(best_list)):
        best_start = best_list[i][0]
        best_end = best_list[i][1]
        test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
        print(test)
        if test:
            flag = False
            break
        flag = True
    if flag:
        best_list.append([local_start, local_end])

主要思想是检查每个元素，如果它通过所有重叠测试，则添加它（我的代码的最后一行）。不是之前。

Answer 2

假设您解析了csv并且已经有一个[(start, stop, index), ....]列表为[(int, int, float), ...]，那么您可以使用以下内容对其进行排序：

from operator import itemgetter
data = sorted(data, key=itemgetter(2), reverse=True)

这意味着您按第三个位置排序并以相反的顺序从最大值返回到最小值。

def nonoverlap(data):
    result = [data[0]]
    for cand in data[1:]:
        start, stop, _ = cand
        current_span = range(start, stop+1)
        for item in result:
            i, j, _ = item
            span = range(i, j+1)
            if (start in span) or (stop in span):
                break
            elif (i in current_span) or (j in current_span):
                break
        else:
            result.append(cand)
    return result

然后使用上述功能，您将获得所需的结果。对于提供的代码段，您将获得[(504, 789, 9.68), (1024, 1257, 7.52)]。我在这里使用的事实是你可以使用1 in range(0, 10)返回True。虽然这是一个天真的实现，但您可以将其作为起点。如果您只想返回开始，停止，请将回复行替换为return [i[:2] for i in result]。

注意：另外我想补充说您的代码存在逻辑错误。您在每次比较后做出决定，但必须在与您best_list中已存在的所有元素进行比较后做出决定。这就是(504, 789)和(1027, 1305)通过测试的原因，但不应该。我希望这张便条可以帮到你。

Python：在迭代范围时将范围添加到范围列表中

2 个答案: