Question

我有以下函数来计算字符串中的单词并提取顶部＆＃34; n＆＃34;：

功能

def count_words(s, n):
"""Return the n most frequently occuring words in s."""

    #Split words into list
    wordlist = s.split()

    #Count words
    counts = Counter(wordlist)

    #Get top n words
    top_n = counts.most_common(n)

    #Sort by first element, if tie by second
    top_n.sort(key=lambda x: (-x[1], x[0]))

    return top_n

因此它按出现排序，如果按字母顺序排列。以下示例：

print count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)

有效（显示[('cat', 4), ('mat', 4), ('bat', 3)]）

print count_words("betty bought a bit of butter but the butter was bitter", 3)

不起作用（显示[('butter', 2), ('a', 1), ('bitter', 1)]但应该betty而不是bitter，因为它们被绑定且be...在bi...之前1}}）

print count_words("betty bought a bit of butter but the butter was bitter", 6)

有效（在[('butter', 2), ('a', 1), ('betty', 1), ('bitter', 1), ('but', 1), ('of', 1)]之前显示betty bitter

可能导致什么（字长可能？）以及如何解决？

Answer 1

问题不是sort来电，而是most_common。 Counter实现为哈希表，因此它使用的顺序是任意。当你要求most_common(n)时，它会返回n最常见的单词，如果有联系，它只会随意决定返回哪一个！

解决此问题的最简单方法是避免使用most_common并直接使用列表：

top_n = sorted(counts.items(), key=lambda x: (-x[1], x[0]))[:n]

Answer 2

您要求排在前3位，因此您可以在选择特定排序顺序中的项目之前剪切数据。

不是让most_common()预先排序然后重新排序，而是使用heapq按自定义条件排序（提供的n小于实际存储区的数量）：< / p>

import heapq

def count_words(s, n):
    """Return the n most frequently occuring words in s."""
    counts = Counter(s.split())
    key = lambda kv: (-kv[1], kv[0])
    if n >= len(counts):
        return sorted(counts.items(), key=key)
    return heapq.nsmallest(n, counts.items(), key=key)

在Python 2上，您可能希望使用iteritems()而不是items()进行上述调用。

这会重新创建Counter.most_common() method，但会使用更新后的密钥。与原始版本一样，使用heapq确保它绑定到O（NlogK）性能而不是O（NlogN）（使用N桶的数量，并且K是您想要查看的最高元素数）。

演示：

>>> count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
[('cat', 4), ('mat', 4), ('bat', 3)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 3)
[('butter', 2), ('a', 1), ('betty', 1)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 6)
[('butter', 2), ('a', 1), ('betty', 1), ('bit', 1), ('bitter', 1), ('bought', 1)]

快速的性能比较（在Python 3.6.0b1上）：

>>> from collections import Counter
>>> from heapq import nsmallest
>>> from random import choice, randrange
>>> from timeit import timeit
>>> from string import ascii_letters
>>> sentence = ' '.join([''.join([choice(ascii_letters) for _ in range(randrange(3, 15))]) for _ in range(1000)])
>>> counts = Counter(sentence)  # count letters
>>> len(counts)
53
>>> key = lambda kv: (-kv[1], kv[0])
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=100000)
2.119404911005404
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=100000)
1.9657367869949667
>>> counts = Counter(sentence.split())  # count words
>>> len(counts)
1000
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=10000)  # note, 10 times fewer
6.689963405995513
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=10000)
2.902360848005628

Answer 3

您可以通过执行.most_common()，然后排序然后切片结果来修复它，而不是将n提供给most_common：

def count_words(s, n):
    """Return the n most frequently occuring words in s."""

    #Split words into list
    wordlist = s.split()

    #Count words
    counts = Counter(wordlist)

    #Sort by frequency
    top = counts.most_common()

    #Sort by first element, if tie by second
    top.sort(key=lambda x: (-x[1], x[0]))

    return top[:n]

使用sort（）进行不一致的排序

3 个答案: