Question

我有一组停用词，我希望从我解析的内容中删除。该列表非常详尽，包含很多代词和其他常用词，例如was，being，our等，但不幸的是i，{{1 }，a和其他人。

我希望删除所有这些停用词，但如果它们被空格包围（包括制表符和换行符），则仅。

我认为这里需要一个正则表达式，但是可以在其中包含一个带变量的正则表达式吗？

正如我在Python中所做的那样，我会有类似的东西：

just

这可行吗？在这种情况下，正则表达式会是什么？

Answer 1

你可以在两个\b之间包装这个词：

>>> import re
>>> txt = "this is a test and retest"
>>> re.sub(r'\btest\b', '****', txt)
'this is a **** and retest'

与\b中的r'\bfoo\b'相同：

匹配空字符串，但只匹配单词的开头或结尾....这意味着'foo'匹配'foo.'，'(foo)'，'bar foo baz'，{ {1}}但不是'foobar'或'foo3'。

Answer 2

(?:^|\s)your_word(?:\s|$)

这应该适合你。使用re.sub。

re.sub(r"(?:^|\s)word(?:\s|$)","",word)

Answer 3

你可以这样做：没有正则表达式：

[ x for x in "hello how are you".split() if x not in stop_words ]

stop_words将是您的停用词列表

看看nltk：

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> text = "hello how are you, I am fine"
>>> words = nltk.word_tokenize(text)
>>> words 
['hello', 'how', 'are', 'you', ',', 'I', 'am', 'fine']
>>> [x for x in words if x not in stop]
['hello', ',', 'I', 'fine']
>>> " ".join([x for x in words if x not in stop])
'hello , I fine'

Answer 4

我最终意识到正则表达式对于我想做的事情来说有点过分，因为我通常只会在我想删除的单词周围留下一个空格

最后，我只是为了这个：

for word in commonWords :
    text = text.replace(' '+word+' ', ' ')

替换给定集合中出现的所有单词，但前提是该单词不包含在另一个单词中

4 个答案: