Question

我必须匹配文本中的所有字母数字。

>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>

这很好，但是我还有一些词可以否定，即不应该在我的最终名单中的词语。

>>> negate_words = ['world', 'other', 'words']

做坏事的坏方法

>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']

但是如果可以改变我的第一个正则表达式模式以考虑否定这些单词，我可以保存一个循环。我发现对字符的否定，但我有言辞否定，我在其他问题中也发现了正则表达式，但这也无济于事。

可以使用python re吗？

更新

我的文字可以跨越几条线条。此外，negate_words列表也可能很长。

考虑到这一点，正在使用正则表达式来完成这样的任务，首先是正确的吗？任何建议??

Answer 1

我不认为使用正则表达式有一种干净的方法。我能找到的最接近的地方有点丑陋而且不完全是你想要的：

>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text)
['hello', '', 'how', 'are', 'you']

为什么不使用Python的集合呢？它们非常快：

>>> list(set(final_list) - set(negate_words))
['hello', 'how', 'are', 'you']

如果订单很重要，请参阅下面的@glglgl回复。他的列表理解版本非常易读。这是使用itertools：

的快速但不太可读的等价物

>>> negate_words_set = set(negate_words)
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list))
['hello', 'how', 'are', 'you']

另一种选择是使用re.finditer在单个传递中构建单词列表：

>>> result = []
>>> negate_words_set = set(negate_words)
>>> result = []
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text):
    word = mo.group()
    if word not in negate_words_set:
         result.append(word)

>>> result
['hello', 'how', 'are', 'you']

Answer 2

也许值得为此尝试pyparsing：

>>> from pyparsing import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(Suppress(oneOf(negate_words)) ^ Word(alphanums)).ignore(CharsNotIn(alphanums))
>>> parser.parseString('hello world!! how are you?').asList()
['hello', 'how', 'are', 'you']

请注意，oneOf(negate_words)必须在Word(alphanums)之前，以确保它与之前的匹配。

编辑：为了好玩，我使用lepl（也是一个有趣的解析库）重复练习

>>> from lepl import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any())
>>> parser.parse('hello world!! how are you?')
['hello', 'how', 'are', 'you']

Answer 3

不要对正则表达式做太多无用的问题相反，想想发电机。

import re

unwanted = ('world', 'other', 'words')

text = "hello world!! how are you?"

gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text))
li = [ w for w in gen if w not in unwanted ]

可以创建一个生成器而不是 li ，也是

python正则表达式否定了一个单词列表？

3 个答案: