Question

假设我有一个字符串字典作为键和值作为整数。其中键将是遇到的不同字符串以及遇到它们的次数。

例如："word word word"会产生：{"word" : 3}

我想对变量说：

item -> our dictionary
string -> word encountered

if string in item:
    # increase existing keys' value by 1
    item.update({string, item.get(string) + 1})

else:
    # create the key and initialize value to 1
    item.update({string : 1})

这个算法很慢，因为通过调用update和string in item方法进行两次散列，如果python执行散列检查字符串是否存在于item中，则会更快，如果将值增加1，则会更快存在一个键或创建键并将值设置为1。

在Java中，相应的方法是：

item.merge(string, 1, Integer::sum)

将代码从if-else语句缩减为一行，然后再次跳过散列。只是想知道python 3中是否存在这样的方法。

提前致谢！

Answer 1

我使用不同的方式填写字典进行了一些时序分析。首先，设置：

import collections, re    
lorem = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
words = re.findall("\w+", lorem.lower())

现在，使用update的方法，或仅使用+=或get默认使用您的方法，以及defaultdict和Counter ：

def f1():
    d = {}
    for w in words:
        if w in d:
            d.update({w: d[w] + 1})
        else:
            d.update({w: 1})
    return d

def f2():
    d = {}
    for w in words:
        if w in d:
            d[w] += 1
        else:
            d[w] = 1
    return d

def f3():
    d = {}
    for w in words:
        d[w] = d.get(w, 0) + 1
    return d

def f4():
    d = collections.defaultdict(int)
    for w in words:
        d[w] += 1
    return d

def f5():
    return collections.Counter(words)

它们都产生相同的结果，尽管最后两个使用dict的子类：

In [41]: f1() == f2() == f3() == f4() == f5()
Out[41]: True

在这里使用update非常浪费;即使使用+=检查，in也是最快的，而defaultdict和Counter更短，但也更慢。

In [42]: %timeit f1()
10000 loops, best of 3: 81.8 us per loop

In [43]: %timeit f2()
10000 loops, best of 3: 24.8 us per loop

In [44]: %timeit f3()
10000 loops, best of 3: 40.8 us per loop

In [45]: %timeit f4()
10000 loops, best of 3: 52.6 us per loop

In [46]: %timeit f5()
10000 loops, best of 3: 104 us per loop

但请注意，在此示例文本中，大多数单词只出现一次，这可能会使测试产生偏差。使用words = words * 100，我们得到此信息，使Counter减慢速度，defaultdict最快。

In [2]: %timeit f1()
100 loops, best of 3: 8.21 ms per loop

In [3]: %timeit f2()
100 loops, best of 3: 2.76 ms per loop

In [4]: %timeit f3()
100 loops, best of 3: 3.58 ms per loop

In [5]: %timeit f4()
100 loops, best of 3: 2.13 ms per loop

In [6]: %timeit f5()
100 loops, best of 3: 6.11 ms per loop

不过，个人而言，我使用Counter因为运行时间的差异可能不是什么大问题，它是最短的，意图是立即明确的，它也提供了一些有用的辅助方法，比如获取最常见的条目等。

Answer 2

惯用语将是

from collections import defaultdict
d = defaultdict(int)
for word in "word word word".split():
    d[word] += 1

Python字典，寻找特定的方法

2 个答案: