Question

我知道在这个确切的问题上有几个非常相似的答案，但没有一个真正回答我的问题。

我正在尝试从单词列表中删除一系列停用词和标点符号，以执行基本的自然语言处理。

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation


    text = "Hello there. I am currently typing Python. "
    custom_stopwords = set(stopwords.words('english')+list(punctuation))

    # tokenizes the text into a sentence
    sentences = sent_tokenize(text)

    # tokenizes each sentence into a list of words
    words = [word_tokenize(sentence) for sentence in sentences]
    filtered_words = [word for word in words if word not in custom_stopwords]
    print(filtered_words)

这会在TypeError: unhashable type: 'list'行引发filtered_words错误。为什么抛出这个错误？我根本没有提供list集合 - 我提供了set？

注意：我已阅读SO on this exact error上的帖子，但仍有相同的问题。接受的答案提供了这样的解释：

设置要求其项目可以清除。超出预定义的类型 Python只是不可变的，如字符串，数字和元组，可以使用。可变类型，例如列表和dicts，是不可清除的因为更改其内容会改变哈希并打破查找代码。

我在这里提供了一组字符串，那么为什么Python仍在抱怨？

编辑：在详细阅读了SO post，建议使用tuples后，我编辑了我的收藏对象：

custom_stopwords = tuple(stopwords.words('english'))

我还意识到我必须压扁我的列表，因为word_tokenize(sentence)将创建列表列表，并且不会正确过滤掉标点符号（因为列表对象不在custom_stopwords中，这是字符串列表。

然而，这仍然引出了一个问题 - 为什么元组被认为是可以通过Python进行的，但字符串集不是？为什么TypeError说list？

Answer 1

words是列表，因为word_tokenize()会返回单词列表。

当您[word for word in words if word not in custom_stopwords] word时，list实际上属于word not in custom_stopwords类型。当word＆＃34;设置为＆＃34;需要检查条件，for (int k=0; k<=10; k++){ String nb=String.valueOf(k); nombres[k]= new Image(getClass().getResourceAsStream("resource/"+nb+".png")); boutons[k]= new Button(); boutons[k].setGraphic(new ImageView(nombres[k])); }需要进行哈希处理，因为列表是可变容器且在Python中不可哈希。

这些帖子可能有助于理解什么是＆＃34; hashable＆＃34;为什么可变容器不是：

类型错误：不可用类型：使用Python字符串集时的列表

1 个答案: