Question

doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])    
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
    words = wordlists.words(infile)
    print(infile)
    #print(words)
    finalized_words = remove_urdu_stopwords(stopwords, words)
    print("\n==== WITHOUT STOPWORDS ===========\n")
    print(finalized_words)
    doc_clean.append(finalized_words)
fdist1 = FreqDist(doc_clean) 
print(fdist1)

我正在尝试计算词汇表中每个单词的频率。我有10个文档，首先我执行了标记化，然后从这些文档中删除了一些停止词，我在nltk中读到有关频率分布的信息，我尝试使用计算这些文档中每个项目的频率。但我得到errorTypeError：unhashable type：'list'

Answer 1

我猜你打算构建一个包含所有单词的列表（清理后），但是这一行会将每个列表附加为doc_clean的元素：

doc_clean.append(finalized_words)

基本上，FreqDist会计算列表中的不同元素 - 所以如果这些元素是列表，那么就会遇到问题。要构建所有文档中单词的单个列表，请将append()替换为extend()：

doc_clean.extend(finalized_words)

得到频率分布错误，TypeError：不可用类型：'list'

1 个答案: