Question

我正在学习Python。我遇到了性能问题。对于单个字典，我想删除密钥

a键是另一个键中的子字符串

如果

，我不想删除密钥

关键子字符串本身是

我的密钥是唯一的字符串，大多数长度在3-50个字符之间。我正在使用的词典有100,000个或更多项目，进行了数十亿次比较。由于这是一个O（n ^ 2）问题，我应该停止尝试优化此代码吗？还是有空间在这里取得进展？

字典是可取的，但我对其他类型开放。

例如：＆＃39;你好＆＃39;包含＆＃39;他＆＃39;和＆＃39;。我想删除密钥＆＃39;他＆＃39;和＆＃39;同时保持＆＃39;你好＆＃39;。我想在其他键的中间删除前缀，后缀和键子串。

密钥是逐个生成的，并添加到字典中。然后reduce_dict(dictionary)运行。我的假设是：在他们被添加到字典中时进行的测试与之后的函数测试一样慢，如下面的代码所示。

def reduce_dict(dictionary):
    reduced = dictionary.copy()
    for key in dictionary:
        for key2 in dictionary:
            if key != key2:
                if key2 in key:
                    reduced.pop(key2, 0)
    return reduced

Answer 1

鉴于您的字符串有点小，您可以为每个键存储所有可能子字符串的哈希集。这样，对于给定的子字符串，您可以在O（N）时间内找到所有具有匹配子字符串的键，但是，由于您要构建，因此需要权衡的是增加插入的时间复杂度每个新密钥的一组子串。

Answer 2

我认为你可以创建一个好的＆＃34;键（=那些不是其他子串的键）以略微优化的方式：

# keys = yourDict.keys(), e.g.
keys = ['low', 'el', 'helloworld', 'something', 'ellow', 'thing', 'blah', 'thingy']

# flt is [[key, is_substring],...] sorted by key length reversed
flt = [[x, 0] for x in sorted(keys, key=len, reverse=True)]

for i in range(len(flt)):
    p = flt[i]
    if p[1]:  # already removed
        continue
    for j in range(i + 1, len(flt)): # iterate over shorter strings
        q = flt[j]
        if not q[1] and q[0] in p[0]: # if not already removed and is substring
            q[1] = 1  # remove

goodkeys = set(x[0] for x in flt if not x[1])
print goodkeys # e.g ['helloworld', 'something', 'thingy', 'blah']

现在删除是微不足道的：

newdict = {k:olddict[k] for k in goodkeys}

Answer 3

如果代替key2 in key（即＆＃34; key2是key＆＃34;的子字符串，则将您的要求更改为＆＃34; key2是key＆＃34;的前缀（如您的示例所示），您可以使用trie进行有效的前缀检查。请参阅this answer。

首先在上面的答案中定义make_trie：

_end = '_end_'

def make_trie(*words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict = current_dict.setdefault(_end, _end)
    return root

然后从上面的答案中定义一个类似in_trie的函数，但检查一个键是否是另一个键的严格前缀：

def is_strict_prefix_of_word_in_trie(trie, word):
   current_dict = trie
   for letter in word:
       if letter in current_dict:
           current_dict = current_dict[letter]
       else:
           return False
   else:
       if _end in current_dict:
           return False # it's actually in the trie
       else:
           return True # it's a strict prefix of a word in the trie

最后，执行删除操作：

def reduce_dict(dictionary):
    trie = make_trie(dictionary.keys())
    reduced = dictionary.copy()
    for key in dictionary:
       if is_strict_prefix_of_word_in_trie(trie, key):
           reduced.pop(key, 0)
    return reduced

或者您可以使用词典理解：

def reduce_dict(dictionary):
    trie = make_trie(dictionary.keys())
    return {key: value for (key, value) in dictionary \
            if not is_strict_prefix_of_word_in_trie(trie, key)}

Answer 4

如果词典是静态的，恕我直言，优化操作是没有用的：它只会运行一次，并且比你需要的时间更少，仔细优化和测试优化。

如果字典是动态的，您可以尝试将时间戳设置为值，如果有意义保留已经清除的键列表。因此，当您再次运行清洁过程时，您有两组密钥：一个处理过的密钥（大小为n1），以及新的密钥大小（n2）。你只比较：

新密钥可以是旧密钥的子字符串
旧密钥可以是新密钥的子字符串
新密钥可以是新密钥的子字符串

所以你有n2 *（n2 + 2 * n1）个比较。如果n>＆gt; n2是O（n * n2 * 2）。

或者，如果在字典中添加元素不在时间限制操作中（也不在交互式操作中），则可以在O（2n）中的每次添加中测试，而无需添加任何其他内容（既不保留键，也不时间戳）。

实际上，如果你用一个简单的O（n ²）algorythm清理你的词典一次，然后在生成新元素时控制键，你可以安全地假设没有现有的键可以是另一个的子串。你只需要测试：

是新密钥，是现有密钥的子字符串 - 在最坏的情况下是n个操作（但可能是最常见的）
是现有密钥，是新密钥的子字符串 - 在所有情况下都是n个操作。

唯一的要求是，您必须从不尝试添加密钥，然后才能完全清除前一个密钥。如果一个进程中只有一个线程加入到字典中，那么这可能是显而易见的，如果不是，则需要同步。

Answer 5

由于keys是字符串，您可以使用find方法按键获取substring和delete。

如果d是字典，

d = {'hello': 1, 'he': 2, 'llo': 3, 'world': 4, 'wor': 5, 'ld': 6, 'python': 2.7}

for key in d.keys():
    for sub in d.keys():
        if key.find(sub) >= 0:
            if key == sub:
                continue
            else:
                del(d[sub])

d将是

{'python': 2.7, 'world': 4, 'hello': 1}

如果它是任何其他键中的子字符串，请删除字典键

5 个答案: