Question

我对python /编程社区比较陌生，所以请原谅我相对简单的问题：我想在将csv文件解释之前过滤掉停用词。但是我需要停止的话语＆＃34;这＆＃34; /＆＃34;这些＆＃34;被列入最后一组。

在Python中导入nltk停用词并将其定义为

stopwords = set(stopwords.words('english'))

...我怎么能修改这个设置保持＆＃34;这＆＃34; /＆＃34;这些＆＃34;在？

我知道我可以手动列出每个单词，除了这两个问题，但我正在寻找更优雅的解决方案。

Answer 1

如果您希望最终集中包含这些停用词，只需将其从默认的停用词列表中删除：

lst = df.sort_values('date')['number'].ffill().tolist()

for i in range(1, len(lst)):
    if abs(lst[i] - lst[i-1]) / lst[i] <= 0.10:
        lst[i] = lst[i-1]

df['number'] = list(reversed(lst))

#    date  number
# 0  2019   150.0
# 1  2018   115.0
# 2  2017   115.0
# 3  2016   115.0
# 4  2015   115.0
# 5  2014   100.0
# 6  2013   100.0
# 7  2012   100.0
# 8  2011   100.0

或者，

new_stopwords = set(stopwords.words('english')) - {'this', 'these'}

to_remove = ['this', 'these'] new_stopwords = set(stopwords.words('english')).difference(to_remove)接受任何可迭代的内容。

如何在python中修改NLTK停止词列表？

1 个答案: