我使用collections.Counter
对某个字符串中的单词进行计数:
s = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."""
lorem = s.lower().split()
请注意,这比我尝试过的实际字符串要小,但是结论/结果是可以概括的。
%%timeit
dcomp = Counter(lorem)
# 8 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
如果我使用它(与cpython / Lib / collections / init .py的源代码相同)
%%timeit
d = dict()
get = d.get
for w in lorem:
d[w] = get(w, 0) + 1
# 15.4 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
编辑:使用功能:
def count():
d = dict()
get = d.get
for w in lorem:
d[w] = get(w, 0) + 1
return d
%%timeit
count()
# Still significantly slower. function definition not in timeit loop.
# 14 µs ± 763 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
对于更大的字符串,结果是相似的,后一个过程大约是第一个过程的1.8-2倍。
可以使用的部分源代码在这里:
def _count_elements(mapping, iterable):
'Tally elements from the iterable.'
mapping_get = mapping.get
for elem in iterable:
mapping[elem] = mapping_get(elem, 0) + 1
其中的映射是其自身的一个实例super(Counter, self).__init__()
-> dict()
。在我将所有后者都尝试放入一个函数并称为该函数之后,速度仍然保持不变。我不明白这种速度差异的来源。 python lib是否有特殊待遇?或一些我忽略的警告。
答案 0 :(得分:2)
更仔细地查看collections/__init__.py
的代码。如您所显示的,它确实定义了_count_elements
,但是随后它尝试执行from _collections import _count_elements
。这表明它是从C库中导入的,该库的优化程度更高,因此速度也更快。仅当找不到C版本时才使用Python实现。