Question

我正在尝试在python中执行以下操作：

将一份陈述文件拆分成句子。
将这些句子拆分成单词。
尝试从单词集中删除停用词。

当我做第二步时，我得到了结果[[＆＃39; Hello＆＃39;，＆＃39; World＆＃39;]，依此类推。我理解（如果我没有错）我有一个列表或嵌套列表，所以可能的错误。但是不知道要解决这个错误。

import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
file = open('C:/temp1/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)#split para into sentences
print "Split sentences are " 
print result1
tokenizer=WhitespaceTokenizer()
result2 = [tokenizer.tokenize(sent) for sent in result1]#obtains the splitted sentences with contractions
print "Split words in each sentences are "
print result2
english_stops=set(stopwords.words('english'))
result3=[word for word in result2 if word not in english_stops]
print result3
Error:
Split sentences are 
['Hello World.', "It's good to see you.", 'Thanks for buying this book.', "Can't is a contraction."]
Split words in each sentences are 
[['Hello', 'World.'], ["It's", 'good', 'to', 'see', 'yTraceback (most recent call last):
File "D:\Learn NLTK\import nltk.py", line 34, in <module>
result3=[word for word in result2 if word not in english_stops]
TypeError: unhashable type: 'list'
ou.'], ['Thanks', 'for', 'buying', 'this', 'book.'], ["Can't", 'is', 'a',   'contraction.']]

我是否需要使用嵌套for循环才能获得停用词过滤？我已经检查了相同的错误，但我是python的新手，所以我无法从这些相关问题中找到任何想法。任何帮助都会很明显。弧。

Answer 1

您正在尝试检查列表中是否有列表。列表不可用，因此您的错误。我没有关于列表理解的专家，其他人将能够更好地回答你的问题，但考虑到你的例子，我会尝试以下方法：

for tokens in result2:
    for word in tokens:
        if not word in english_stops:
            result3.append(word)

（所以答案是肯定的;是的，你需要遍历嵌套列表）

Answer 2

由于列表不可清，因此您无法拥有一组列表。这里有几个答案。你可以查看TypeError : Unhashable type

Python错误：不可用的类型列表

2 个答案: