最常出现在字符串中的n个单词

时间:2016-09-13 09:36:53

标签: python

我遇到以下问题:

问题

在Python中实现一个函数count_words(),它将字符串s和数字n作为输入,并返回s中最常出现的n个单词。返回值应该是元组列表 - 前n个单词与它们各自的计数[(,),...,...]配对,按递减计数顺序排序。

您可以假设所有输入都是小写的,并且没有标点或其他字符(仅字母和单个分隔空格)。如果是平局(相等计数),请按字母顺序排列绑定的单词。

E.g:

打印count_words(“贝蒂买了一点黄油,但黄油很苦”,3) 输出:

[('butter',2),('a',1),('betty',1)]

这是我的解决方案:

    """Count words."""

from operator import itemgetter
from collections import Counter

def count_words(s, n):
    """Return the n most frequently occuring words in s."""

    # TODO: Count the number of occurences of each word in s
    words = s.split(" ");
    words = Counter(words)
    # TODO: Sort the occurences in descending order (alphabetically in case of ties)
    print(words)
    # TODO: Return the top n words as a list of tuples (<word>, <count>)
    top_n = words.most_common(n)
    return top_n

def test_run()

    """Test count_words() with some inputs."""
    print(count_words("cat bat mat cat bat cat", 3))
    print(count_words("betty bought a bit of butter but the butter was bitter", 3))


if __name__ == '__main__':
    test_run()

问题是具有相同计数的元素是任意排序的,我如何按字母顺序排列元素?

3 个答案:

答案 0 :(得分:3)

您可以使用出现次数(按相反顺序)和字典顺序对它们进行排序:

>>> lst = [('meat', 2), ('butter', 2), ('a', 1), ('betty', 1)]
>>> 
>>> sorted(lst, key=lambda x: (-x[1], x[0]))
#                              ^ reverse order 
[('butter', 2), ('meat', 2), ('a', 1), ('betty', 1)]

出现次数优先于lex。顺序。

在您的情况下,使用words.items()代替我使用的列表列表。您将不再需要使用most_common,因为sorted已经做同样的事情。

答案 1 :(得分:0)

python函数sortedstable,这意味着在绑定的情况下,绑定的项目将按相同的顺序排列。因此,您可以先对字符串进行排序,以便按顺序排序:

alphabetical_sort = sorted(words.items(), key=lambda x: x[0])

然后是计数:

final_sort = sorted(alphabetical_sort, key=lambda x: x[1], reverse=True)

编辑:没有看到摩西的更好答案。当然,越少越好。

答案 2 :(得分:0)

这是概念化问题的另一种方式:

def count_words(s,n):

words = s.split(" ")
# TODO: Count the number of occurences of each word in s
counters = {}
for word in words:
    if word in counters:
        counters[word] += 1
    else:
        counters[word] = 1
# TODO: Sort the occurences in descending order (alphabetically in case of ties)
top = sorted(counters.iteritems(), key=lambda d:(-d[1],d[0]))

# TODO: Return the top n words as a list of tuples (<word>, <count>)
top_n = top[:n]
return top_n

def test_run():

print count_words("cat bat mat cat bat cat", 3)
print count_words("betty bought a bit of butter but the butter was bitter", 3)

如果名称 =='主要': test_run()