Question

目前，我的算法需要（估计）超过十个小时才能完成。它现在仍然在运行，所以我可以更好地估计它是多么简单糟糕。

假设我有一组人 P ，每个人都有排序不同长度的事件列表，其中我是一个索引变量。我想创建一个图 G ，使 G _{P _i，P _j} = n ，其中 n 是 P _{之间的边缘权重我} 和 P _j 表示他们在某个特定时间内共同出现的次数静态范围 r 。

我目前的算法是无意识的，并且在Python中实现（具有可读性和明确性），如下所示:(为了简洁起见，改编自its repository on GitHub）

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    for oA in A.occurances:
        for oB in B.occurances:
            if oB in range(oA - radius, oA + radius):
                try:
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

我考虑过更改此算法，以便当oB超过当前oA的范围时，循环只会继续到下一个oA。

考虑到列表已排序，是否有更好的方法实现此目的？

Answer 1

一旦通过上边界，你想继续前进到下一个oA是一个很好的想法。此外，如果A.occurances和B.occurances的范围与“半径”相比较大，那么每次从B.occurances开头都不会更有效：

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    i = 0
    b = B.occurances
    maxi = len(B.occurances) - 1
    for oA in A.occurances:
        lo = oA - radius
        hi = oA + radius
        while (b[i] > lo) and (i > 0):     # while we're above the low end of the range
            i = i - 1                      #   go towards the low end of the range
        while (b[i] < lo) and (i < maxi):  # while we're below the low end of the range
            i = i + 1                      #   go towards the low end of the range
        if b[i] >= lo:
            while (b[i] <= hi):            # while we're below the high end of the range
                try:                       #   increase edge weight
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

                if i < maxi:               #   and go towards the high end of the range
                    i = i + 1
                else:
                    break

请注意，我没有对此进行调试，因此可能存在错误，但希望您可以大致了解我正在尝试做什么。当然，你可以对这个想法进行进一步的优化，但这应该比蛮力方法更有效。

Answer 2

一种选择是将B.occurance放在interval tree中，以便您可以快速查询范围内的所有oB（oA - 半径，oA +半径）。

另一个选择是在桶中索引B.occurances，例如[0,1]，[1,2]等。然后，您可以通过选择具有索引（oA - 半径）到（oA +半径）的桶来快速找到范围内的所有oB（oA - 半径，oA +半径）。存储桶是近似值，因此您仍需要迭代验证第一个和最后一个选定存储桶中的所有oB。

用于查找两个排序列表之间共同出现的算法优化

2 个答案: