Question

我正在尝试编写一个小脚本，它将查看一串文本，删除停用词，然后将该字符串中前10个最常用的单词作为列表返回。

这是我的代码：

from collections import Counter as c
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
description = ("This is some place holder text for a shop that sells shoes, coats and jumpers.  We sell lots of shoes but never sell t-shirts.  Please come to our shop if you want some jumpers")
description = ([word for word in description.lower().split() if word not in stop])
common_list = c(description)
top_ten = (common_list[:9])

然而，这给了我错误消息unhashable type: slice。我认为这是因为common_list实际上可能不是一个列表..我是python的新手所以请原谅这是否真的很傻。

Answer 1

common_list是一个字典，不能切片（common_list [：9]不起作用）。您可能必须将common_list转换为实际列表，并根据事件对其进行排序。

Answer 2

您可以使用以下单行：

top_ten = sorted(c(description).items(), key=lambda p:p[1])[::-1][:10]

<强>为什么吗

你基本上有list个词：

description = ["cat", "fish", "cat", "cat", "dog", "dog"]

然后您可以使用c() function获取每个元素的计数，然后通过c(description)得出：

Counter({'cat': 3, 'dog': 2, 'fish': 1})

然后我们需要对此进行排序，这是通过使用tuple对每个key=lambda p:p[1]的第二个元素进行排序来完成的。在我们的案例中会给出：

[('fish', 1), ('dog', 2), ('cat', 3)]

然后我们需要使用[::-1]撤消它，并使用10获取第一个[:10]元素。哪会留下我们：

[('cat', 3), ('dog', 2), ('fish', 1)]

如果您只想要words，只需从list列表中的每个top_ten获取第一个元素：

[i[0] for i in top_ten]

Answer 3

可以使用Counter对象的most_common method来完成此操作，这非常简单：

top_ten = c(description).most_common(10)

文档说明：

返回n个最常见元素及其计数的列表，从最常见到最少。

因此，当它返回element及其counts而我们只需要element时，我们仍然需要使用list-comprehension：

top_ten = [i[0] for i in c(description).most_common(10)]

Python不可用类型：列表中的切片

3 个答案: