Question

我正在尝试过滤掉我的文字中的停用词：

clean = ' '.join([word for word in text.split() if word not in (stopwords)])

问题在于，text.split()的{{1}}等元素与限位词'word.'不匹配。

我稍后在'word'中使用clean，所以我不想完全摆脱标点符号。

如何在保留标点符号时过滤掉停用词，但过滤sent_tokenize(clean)等词语？

我认为可以更改标点符号：

'word.'

然后

text = text.replace('.',' . ')

但是有更好的方法吗？

Answer 1

首先对文本进行标记，而不是从停用词中清除它。标记器通常会识别标点符号。

import nltk

text = 'Son, if you really want something in this life,\
        you have to work for it. Now quiet! They are about\
        to announce the lottery numbers.'

stopwords = ['in', 'to', 'for', 'the']

sents = []

for sent in nltk.sent_tokenize(text):

    tokens = nltk.word_tokenize(sent)
    sents.append(' '.join([w for w in tokens if w not in stopwords]))

print sents

['儿子，如果你真的想要这样的生活，你有工作吧。'，'现在安静！'，'他们是关于公布彩票号码。']

Answer 2

您可以使用以下内容：

import re

clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])

除了小写和大写的ascii字母外，它会将其除去，并将其与stopcase集或列表中的单词匹配。此外，它假设您在停用词中的所有单词都是小写的，这就是我将单词转换为全小写的原因。如果我做出了很好的假设，那就把它拿出来

另外，我不熟练使用正则表达式，如果有更清晰或更强大的方法，我很抱歉。

过滤标点符号附近的停用词

2 个答案: