我正在处理从csv导入python的投诉数据。
df = pd.DataFrame(compalints, columns=['issue_detail'])
df.head()
我使用单词分词器对数据进行分词
issue = df.issue_detail.apply(word_tokenize)
issue.head()
标记化后的数据看起来像这样
0 [I, have, outdated, information, on, my, credi...
1 [This, company, refuses, to, provide, me, veri...
2 [Need, to, move, into, a, XXXX, facility, ., C...
3 [I, wrote, Equifax, over, 6, weeks, ago, ., Th...
4 [I, received, a, inquiry, alert, from, Experia...
Name: issue_detail, dtype: object
现在我正试图从此数据中删除停用词并使用代码
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
filtered_sentence = [w for w in issue if not w in stop_words]
当我运行过滤后的句子部分时,它会向我显示错误
TypeError:不可散列的类型:“列表”
我已经尝试使用堆栈溢出中列出的所有方法,但到目前为止没有任何作用
任何人都可以建议如何从中删除停用词
答案 0 :(得分:0)
将apply
与list comprehension
一起使用:
df['issue_detail'] = df.issue_detail.apply(word_tokenize)
f = lambda issue: [w for w in issue if not w in stop_words]
df['issue_detail'] = df.issue_detail.apply(f)
或嵌套的list comprehension
:
df['issue_detail'] = df.issue_detail.apply(word_tokenize)
df['issue_detail'] = [[w for w in issue if not w in stop_words] for issue in df.issue_detail]
示例:
d = {'issue_detail':[['I', 'have', 'outdated'],
['This', 'company', 'refuses', 'to']]}
df = pd.DataFrame(data=d)
print (df)
issue_detail
0 [I, have, outdated]
1 [This, company, refuses, to]
stop_words = set(['I','to'])
f = lambda issue: [w for w in issue if not w in stop_words]
df['issue_detail'] = df.issue_detail.apply(f)
print (df)
issue_detail
0 [have, outdated]
1 [This, company, refuses]