Question

我正在尝试计算列表中以字母表中每个字母开头的单词数。我尝试过很多东西，似乎没什么用。最终结果应该是这样的：

list = ['the', 'big', 'bad', 'dog']
a: 0
b: 2
c: 0
d: 1

我认为我应该用字典做些什么，对吧？

Answer 1

from collections import Counter
print Counter(s[0] for s in  ['the', 'big', 'bad', 'dog'])
# Counter({'b': 2, 't': 1, 'd': 1})

如果你想要零，你可以这样做：

import string

di={}.fromkeys(string.ascii_letters,0)
for word in ['the', 'big', 'bad', 'dog']:
    di[word[0]]+=1

print di

如果您只想让'A'与'a'计算相同：

di={}.fromkeys(string.ascii_lowercase,0)
for word in ['the', 'big', 'bad', 'dog']:
    di[word[0].lower()]+=1
# {'a': 0, 'c': 0, 'b': 2, 'e': 0, 'd': 1, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 1, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}

你可以将这两者结合起来：

c=Counter({}.fromkeys(string.ascii_lowercase,0))
c.update(s[0].lower() for s in  ['the', 'big', 'bad', 'dog'])
print c
# Counter({'b': 2, 'd': 1, 't': 1, 'a': 0, 'c': 0, 'e': 0, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0})

Answer 2

myList = ["the", "big", "bad", "dog"]
from string import ascii_lowercase
d = dict.fromkeys(ascii_lowercase, 0)
for item in myList:
    d[item[0]] += 1
print d

<强>输出

{'a': 0, 'c': 0, 'b': 2, 'e': 0, 'd': 1, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 1, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}

Answer 3

正如一个音符，我显示表现出一定的选择了在Python宇宙，从标准的集合或itertools内置类型不同于熊猫，第三方库，例如-in选项。我认为这个答案是次要的，着色的答案 - 不是主要答案。

熊猫网站在这里：

http://pandas.pydata.org/

使用以下设置工具可以轻松获得pandas：

$ pip install pandas

pandas的目的是快速进行syntactically sweet数据分析，就像您在R或电子表格程序（如Microsoft Excel）中所期望的那样。它不断地由韦斯·麦金尼和一个小团队其他贡献者的开发，并在BSD许可级别被释放 - 这意味着它通常免费在自己的项目中使用它，商业或其他方式，只要你属性正确

pandas的一个优点是它在这种情况下的语法非常清晰（value_counts）并且它的实现速度非常快，远远超过原生Python：

from pandas import Series

sample_list = ['the', 'big', 'bad', 'dog']
s = Series([word[0] for word in sample_list])
s.value_counts()

返回：

b    2
d    1
t    1

我们来看看：

In [19]: len(big_words)
Out[19]: 229779

一只熊猫实施：

def count_first(words):
    s = Series([word[0] for word in words])
    return s.value_counts()

In [15]: %timeit count_first(big_words)
10 loops, best of 3: 29.6 ms per loop

上面接受的答案：

def counter_first(words):
   return Counter(s[0] for s in words)

%timeit counter_first(big_words)
10 loops, best of 3: 105 ms per loop

即使在函数中进行列表转换，速度也要快得多。通过强制列表转换，我们对熊猫不公平。让我们假设我们从一个系列开始解决这个问题。

In [20]: s = Series([word[0] for word in words])

In [21]: %timeit s.value_counts()
1000 loops, best of 3: 406 µs per loop

这是加速的258.6倍。

我什么时候会考虑使用pandas代替Counter？

一个很好的例子就是垃圾邮件分类器。如果你正在接近自然语言处理问题，并通过启动与单个字母的单词的相对发病率来分析字的选择需要，而你拥有数百万的话，利用大熊猫加快看着成千上万的电子邮件和/或网站会很重要。

底线是pandas是一个性能更高的库，但需要一些包管理（Python或基于os）才能获得。

Answer 4

In [63]: %%timeit
....: from collections import defaultdict
....: fq = defaultdict( int )
....: for word in words:
....:        fq[word[0].lower()] += 1
....:
10 loops, best of 3: 138 ms per loop


In [64]: %%timeit
....: from collections import Counter
....: r = Counter(word[0].lower() for word in words)
....:
1 loops, best of 3: 287 ms per loop

In [65]: len(words)
Out[65]: 235886

单词来源来自/usr/share/dict/words。对于上面的演示，使用了IPython timeit函数。

In [68]: fq
Out[68]:defaultdict(<type 'int'>, {'a': 17096, 'c': 19901, 'b': 11070, 'e': 8736, 'd': 10896, 'g': 6861, 'f': 6860, 'i': 8799, 'h': 9027, 'k': 2281, 'j': 1642, 'm': 12616, 'l': 6284, 'o': 7849, 'n': 6780, 'q': 1152, 'p': 24461, 's': 25162, 'r': 9671, 'u': 16387, 't': 12966, 'w': 3944, 'v': 3440, 'y': 671, 'x': 385, 'z': 949})

我建议使用defaultdict，因为它是直接的方法，速度更快。

n [69]: %%timeit
....: d = {}
....: for word in words:
....:        key = word[0].lower()
....:        if key in d:
....:                d[key] += 1
....:        else:
....:                d[key] = 1
....:
1 loops, best of 3: 177 ms per loop

与Counter相比，正常方法似乎也更快，但很少有额外的代码行。

计算以字母Python开头的单词数

4 个答案: