Question

我正在使用NLTK从列表元素中删除停用词。这是我的代码片段

dict1 = {}
    for ctr,row in enumerate(cur.fetchall()):
            list1 = [row[0],row[1],row[2],row[3],row[4]]
            dict1[row[0]] = list1
            print ctr+1,"\n",dict1[row[0]][2]
            list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')]
            print list2

问题是，这不仅删除了停用词，而且还删除了其他词中的字符，例如从'orientation''i'这个单词中删除更多的停用词，并且它会在list2中存储字符而不是单词。即['O'，'r'，'e'，'n'，'n'，''，'f'，''，'3'，''，'r'，'e'，'r' ，'e'，''，'p'，'n'，'\ n'，'\ n'，'\ n'，'O'，'r'，'e'，'n'，'n' ，''，'f'，''，'n'，''，'r'，'e'，'r'，'e'，''，'r'，'p'，'l'.. ..................... 而我想把它存储为['Orientation'，'....................

Answer 1

首先，确保list1是单词列表，而不是字符数组。在这里，我可以给你一个你可以利用它的代码片段。

from nltk import word_tokenize
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')    # get english stop words

# test document
document = '''A moody child and wildly wise
Pursued the game with joyful eyes
'''

# first tokenize your document to a list of words
words = word_tokenize(document)
print(words)

# the remove all stop words
content = [w for w in words if w.lower() not in english_stopwords]
print(content)

输出将是：

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes']
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

Answer 2

首先，你对list1的构造对我来说有点特殊。我认为有更多的pythonic解决方案：

list1 = row[:5]

那么，您是否有理由使用dict1 [row [0]] [3]访问row [3]，而不是直接访问row [3]？

最后，假设该行是一个字符串列表，从行[3]构造list2会遍历每个字符，而不是每个字。这可能就是你解析出'i'和'a'（以及其他一些角色）的原因。

正确的理解是：

list2 = [w for w in row[3].split(' ') if w not in stopwords]

你必须以某种方式将你的弦分开，可能是在空格周围。这取决于：

'Hello, this is row3'

要

['Hello,', 'this', 'is', 'row3']

迭代它会给你完整的单词，而不是单个的字符。

在python

2 个答案: