Question

我有一本约150,000键的字典。没有重复的密钥。每个密钥长度为127个字符，每个密钥在1-11个位置不同（大多数差异发生在密钥的末尾）。每个键的值是唯一ID和空白列表[]。对于给定的密钥，我想找到所有其他密钥相差1个字符，然后将ID附加到给定的密钥空白列表中。最后，我想要一个键及其值（一个ID和一个字符的列表，不同于一个字符）。

我的代码有效，但问题是它太慢了。双循环是150,000 ^ 2 = ~25亿。在我的计算机上，我每分钟可以循环大约200万次（每次都执行match1函数）。这需要大约8天才能完成。没有match1函数的循环运行速度快〜7倍，因此可以在~1天内完成。

我想知道是否有人知道如何提高速度呢？

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    for c1, c2 in zip(reversed(s1), reversed(s2)):
        if s < 2:
            if c1 != c2:
                s = s + 1
        else:
            break
    if s == 1:
        dict1[s1][1].append(dict1[s2][0])


for s1 in dict:
    for s2 in dict:
        match1(s1,s2,dict)

Answer 1

目前，您正在针对每个其他密钥检查每个密钥，以进行总共O(n^2)次比较。我们只需要检查其他密钥的一小部分即可。

假设每个键的字符具有k个不同值的字母表。例如，如果您的密钥是由a-z和0-9组成的简单ASCII字符串，那么k = 26 + 10 = 30。

给定任何密钥，我们可以生成所有可能的密钥，这些密钥只有一个字符：有127 * k个字符串。在您将每个密钥与~150,000个其他密钥进行比较之前，现在我们只需要与127 * k进行比较，对于k = 30的情况，这是3810。这会将总时间复杂度从O(n^2)降低到O(n * k)，其中k是一个独立于n的常量。此是您扩展n时真正加速的地方。

这里有一些代码用于生成密钥的所有可能邻居：

def generate_neighbors(key, alphabet):
    for i in range(len(key)):
        left, right = key[:i], key[i+1:]
        for char in alphabet:
            if char != key[i]:
                yield left + char + right

所以，例如：

>>> set(generate_neighbors('ab', {'a', 'b', 'c', 'd'}))
{'aa', 'ac', 'ad', 'bb', 'cb', 'db'}

现在我们计算每个键的邻域：

def compute_neighborhoods(data, alphabet):
    keyset = set(data.keys())
    for key in data:
        possible_neighbors = set(generate_neighbors(key, alphabet))
        neighbors = possible_neighbors & keyset

        identifier = data[key][0]

        for neighbor in neighbors:
            data[neighbor][1].append(identifier)

现在举个例子。假设

data = {
 '0a': [4, []],
 '1f': [9, []],
 '27': [3, []],
 '32': [8, []],
 '3f': [6, []],
 '47': [1, []],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, []]
}

然后：

>>> alphabet = set('abcdef01234567890')
>>> compute_neighborhoods(data, alphabet)
>>> data
{'0a': [4, []],
 '1f': [9, [6]],
 '27': [3, [1]],
 '32': [8, [5, 6]],
 '3f': [6, [8, 9]],
 '47': [1, [3]],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, [8]]}

我还没有在这里实施一些优化措施。首先，你说这些关键字在后来的字符上大多不同，并且它们最多在11个位置上有所不同。这意味着我们可以更聪明地计算交叉点possible_neighbors & keyset并生成邻域。首先，我们修改generate_neighbors以首先修改密钥的尾随字符。然后，我们不是一次生成整个邻居集，而是一次生成一个邻居并检查是否包含在data字典中。我们记录了我们找到了多少，如果我们找到11，我们就会破解。

我在答案中没有实现这一点的原因是我不确定它会导致显着的加速，并且实际上可能更慢，因为这意味着删除使用纯Python循环优化Python内置（集合交集）。

Answer 2

这是未经测试的，所以可能只是闲置的推测，但是......你可以减少字典查找的数量（更重要的是）通过将dict构建到列表中并仅比较剩余项目来消除一半的比较在列表中。

_dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}

# assuming python 3
itemlist = list(_dict.items())

while itemlist:
    key1, value1 = itemlist.pop()
    for key2, value2 in itemlist:
        # doesn't have early short circuit but may have fewer lookups to compensate
        if sum(c1 == c2 for c1, c2 in zip(key1, key2)) == 1:
            value1[1].append(key2)
            value2[1].append(key1)

Answer 3

试试这段代码：

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    #reverse and zip computations are avoided
    index = 127-1
    while (index>=0 && s<2):
        if(s1[index] == s2[index]):
            s = s + 1

    if (s == 1): 
        #we are modifying both s1 and s2 instead of only s1 to improve performance
        dict1[s1][1].append(dict1[s2][0])
        dict1[s2][1].append(dict1[s1][0])

keys = dict.keys()
#no of times match1 will be invoked is (n-1)*n/2 instead of n*n
for i in range(0, len(keys)):
    for j in range(i+1, len(keys)):
        #if match1(s1,s2,dict) is invoked then no need to call match1(s2,s1,dict) because now match1 function will take care of it. So only either one needs to be called
        match1(keys[i],keys[j],dict)

Optmizations：

避免反向和zip计算
对于每个键，仅与键列表中出现晚于此键的键进行比较。
match1（）修改s1和s2，而不是仅修改s1。这可以通过交换来完成，即s1与s2相比，s2与s1相比是相同的
键列表存储在变量中并通过索引访问，以便python不会在执行期间创建新的临时列表

Answer 4

对于键匹配部分，使用Levenshtein匹配进行极快的比较。 Python-Levenshtein是基于c-extention的实现。使用它的hamming（）函数来确定不同字符的数量。

使用Git链接安装它：

pip install git+git://github.com/ztane/python-Levenshtein.git

现在，通过将其插入@ tdelaney的答案，使用如下：

import Levenshtein as lv

itemlist = list(_dict.items())

while itemlist:
   if lv.hamming(key1, key2) == 1:
       key1, value1 = itemlist.pop()
       for key2, value2 in itemlist:
           value1[1].append(key2)
           value2[1].append(key1)

for循环中for循环的更快替代方案

4 个答案: