如何在Python中从对象中删除停用词

时间:2018-08-19 12:27:05

标签: python nltk

我正在处理从csv导入python的投诉数据。

df = pd.DataFrame(compalints, columns=['issue_detail'])
df.head()

我使用单词分词器对数据进行分词

    issue = df.issue_detail.apply(word_tokenize)
issue.head()

标记化后的数据看起来像这样

0    [I, have, outdated, information, on, my, credi...
1    [This, company, refuses, to, provide, me, veri...
2    [Need, to, move, into, a, XXXX, facility, ., C...
3    [I, wrote, Equifax, over, 6, weeks, ago, ., Th...
4    [I, received, a, inquiry, alert, from, Experia...
Name: issue_detail, dtype: object

现在我正试图从此数据中删除停用词并使用代码

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

filtered_sentence = [w for w in issue if not w in stop_words]

当我运行过滤后的句子部分时,它会向我显示错误

TypeError:不可散列的类型:“列表”

我已经尝试使用堆栈溢出中列出的所有方法,但到目前为止没有任何作用

任何人都可以建议如何从中删除停用词

1 个答案:

答案 0 :(得分:0)

applylist comprehension一起使用:

df['issue_detail'] = df.issue_detail.apply(word_tokenize)

f = lambda issue: [w for w in issue if not w in stop_words]
df['issue_detail'] = df.issue_detail.apply(f)

或嵌套的list comprehension

df['issue_detail'] = df.issue_detail.apply(word_tokenize)

df['issue_detail'] = [[w for w in issue if not w in stop_words] for issue in df.issue_detail]

示例

d = {'issue_detail':[['I', 'have', 'outdated'],
['This', 'company', 'refuses', 'to']]}
df = pd.DataFrame(data=d)
print (df)
                   issue_detail
0           [I, have, outdated]
1  [This, company, refuses, to]

stop_words = set(['I','to'])
f = lambda issue: [w for w in issue if not w in stop_words]
df['issue_detail'] = df.issue_detail.apply(f)
print (df)
               issue_detail
0          [have, outdated]
1  [This, company, refuses]