我正在通过Pandas导入csv文件,格式如下:
test = [
('the beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
我想检查停止列表中的任何单词是否包含在定义的测试集中,如果是,请将其删除。但是,在尝试这样做时,我只是返回完整列表而不做任何更改。这是我目前的代码:
df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]
def remove_stopwords(train_list):
new_list = []
for word in train_list:
if word not in stopwords.words('english'):
new_list.append(word)
print new_list
remove_stopwords(tlist)
我正在尝试使用NLTK语料库提供的停用词。就像我说的那样,当我用print(new_list)测试这段代码时发生的一切都是我回到tlist集,就像它一样。
答案 0 :(得分:2)
@Vardan的观点绝对正确。必须有两个循环,一个用于元组,另一个用于实际句子。 但是我们可以将字符串转换为标记并检查停用词,而不是采用原始数据(以字母表示)。
以下代码应该可以正常工作:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]
def remove_stopwords(train_list):
new_list = []
for word in train_list:
total='' #take an empty buffer string
word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens
for txt in word_tokens:
if txt not in stopwords.words('english'): #Check each token against stopword
total=total+' '+txt #append to the buffer
new_list.append((total,word[1])) #append the total buffer along with pos/neg to list
print new_list
remove_stopwords(tlist)
print tlist
答案 1 :(得分:1)
for循环中的单词实际上是一个元组。因为tlist的格式为 [(a1,b1),(a2,b2)] (元组列表)。现在将每个元组与一个单词用停用词进行比较。如果你这样做,你会看到它:
def remove_stopwords(train_list):
new_list = []
for word in train_list:
print(word)
if word not in stopwords:
new_list.append(word)
print (new_list)
如果你想删除这些单词,你应该至少有两个循环,一个用于迭代列表,另一个用于迭代单词。 这样的事情会起作用:
def remove_stopwords(train_list):
new_list = []
for tl in train_list:
Words = tl[0].split()
# tl would be ('the beer was good.', 'pos')
for word in Words: # words will be the , beer, was, good.
if word not in stopwords:
new_list.append(word)
print (new_list)
答案 2 :(得分:0)
def remove_stopwords(train_list):
global new_list
new_list = []
for line in train_list:
for word in line:
if word not in stopwords.words('english'):
break
new_list.append(word)
return new_list
def remove_stopwords(train_list):
global new_list
new_list = []
for line, gr in train_list:
for word in line:
if word not in stopwords.words('english'):
line = line.replace(" %s " % word, ' ')
new_list.append(word)
return new_list