Question

我有一个包含很多句子的列表。我想遍历列表，从所有句子中删除“and”，“the”，“a”，“are”等词语。

我试过了：

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
    text = text.replace(i, j)
return text

然而，正如您可能知道的那样，当它出现在单词的中间时，这将删除“a”和“an”。我需要在空格分隔时仅删除单词的实例，而不是当它们在单词内时。最有效的方法是什么？

Answer 1

我会选择正则表达式，例如：

def removearticles(text):
  re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)

或者如果你想删除前导空格：

def removearticles(text):
  re.sub('\s+(a|an|and|the)(\s+)', '\2', text)

Answer 2

这看起来更像是NLP工作而不是直接正则表达式。我会查看NLTK（http://www.nltk.org/）IIRC它带有一个充满填充词的语料库，就像你试图摆脱的那些。

Answer 3

尝试一下

的内容

articles = ['and', 'a']
newText = ''
for word in text.split(' '):
    if word not in articles:
        newText += word+' '
return newText[:-1]

Answer 4

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
rest = []
for word in text.split():
    if word not in articles:
        rest.append(word)
return ' '.join(rest)

dict的{p> in运算符比列表运行得快。

Answer 5

可以使用regex完成。迭代器通过你的字符串或（''.join列表并将其作为字符串发送）到以下正则表达式。

>>> import re
>>> rx = re.compile(r'\ban\b|\bthe\b|\band\b|\ba\b')
>>> rx.sub(' ','a line with lots of an the and a baad')
'  line with lots of         baad'

从Python中的字符串中删除所有文章，连接词等

5 个答案: