Question

我很好奇为什么在代码中删除一行会导致性能显着提高。函数本身接受字典并删除所有键，这些键是其他键的子串。

减慢代码速度的行是：

if sub in reduced_dict and sub2 in reduced_dict:

这是我的功能：

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

该函数检查sub是否在sub2中多次。我假设如果我检查了这个比较已经完成，我会节省自己的时间。似乎并非如此。为什么在字典中查找的常量时间函数会减慢我的速度？

我是初学者，所以我对概念很感兴趣。

当我测试有问题的行是否返回False时，它似乎是。我用以下

测试了这个

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

对于函数输入字典中的14,805个键：

19.6360001564秒没有这条线
33.1449999809秒用行

这是3个字典示例。 Biggest sample dictionary with 14805 keys，medium sample dictionary和smaller sample dictionary

我在最大示例字典中的前14,000个键的键数（X）中以秒（Y）对输入大小绘制时间。似乎所有这些功能都具有指数复杂性。

John Zwinck answer提出这个问题
在没有字典的情况下使用我的算法来解决这个问题对比
Matt exponential是我第一次尝试解决这个问题。这需要76s
Matt compare是此问题中的算法与dict比较行
tdelaney solution这个问题。算法1＆amp;
georg solution来自我提出的相关问题

enter image description here

接受的答案在明显线性的时间内执行。

Marcelo Cantos

我很惊讶地发现输入大小存在魔术比率，其中dict查找的运行时间==字符串搜索。

Answer 1

对于样本语料库或大多数键很小的语料库，测试所有可能的子键要快得多：

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

我的系统需要大约0.2秒（i7-3720QM 2.6GHz）。

Answer 2

您创建len_dict，但即使它对相同大小的键进行分组，您仍然需要多次遍历所有内容才能进行比较。您的基本计划是正确的 - 按大小排序，只比较大小相同或更大，但还有其他方法可以做到。下面，我刚刚创建了一个按密钥大小排序的常规列表，然后向后迭代，以便我可以在我去的时候修剪字典。我很好奇它的执行时间与你的相比。它在.049秒内完成了你的小词典。

（我希望它确实有效！）

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for k_fwd,v_fwd in items[i+1:]:
            if k in k_fwd:
                del items[i]
                break
    return dict(items)

修改

通过不解包k_fwd，v_fwd来显着提高速度（在运行几次之后，这实际上不是加速。还有其他东西一定是在我的电脑上花了一段时间）。

def myfilter(d): items = d.items() items.sort(key=lambda x: len(x[0])) for i in range(len(items)-2,-1,-1): k = items[i][0] for kv_fwd in items[i+1:]: if k in kv_fwd[0]: del items[i] break return dict(items)

Answer 3

我会做的有点不同。这是一个生成器功能，只为您提供“好”键。这避免了创建一个可能在很大程度上按键销毁的字典。我还有两个级别的“for”循环和一些简单的优化，试图更快地找到匹配并避免搜索不可能的匹配。

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

如果你想使用它制作一个实际的字典，你可以：

{ key: d[key] for key in reduced_keys(d) }

我的函数的性能从字典中删除键是其他键的子串

3 个答案: