处理python正则表达式中的'++'符号

时间:2011-11-28 12:25:44

标签: python regex

我有一个单词列表
我正在基于这个单词列表创建一个正则表达式对象列表

import re
word = 'This is word of spy++'
wl = ['spy++','cry','fpp']
regobjs = [re.compile(r"\b%s\b" % word.lower() ) for word in wl]

for reobj in regobjs:
    print re.search(regobj, word).group()

但是由于符号++,我在创建正则表达式obj时遇到错误(error: multiple repeat) 如何使正则表达式处理单词列表中所有单词的情况?

    requirements:

       regex should detect the exact word from the given text
 even if the word having non alpha numeric chars like (++) above code detect the exact words except those having ++ char.

3 个答案:

答案 0 :(得分:6)

除了re.escape()之外,您还需要在非字母数字字符之前/之后删除\b字边界,否则匹配将失败。

像这样的东西(不是很优雅,但我希望它能说明问题):

import re
words = 'This is word of spy++'
wl = ['spy++','cry','fpp']
regobjs = []

for word in wl:
    eword = re.escape(word.lower())
    if eword[0].isalnum() or eword[0]=="_":
        eword = r"\b" + eword
    if eword[-1].isalnum() or eword[-1]=="_":
        eword = eword + r"\b"
    regobjs.append(re.compile(eword))

for regobj in regobjs:
    print re.search(regobj, words).group()

答案 1 :(得分:2)

当您的单词以字母,数字或下划线开头或结尾时,您希望使用\b;如果不是,则\B。这意味着您不会选择spy++x,但会选择spy++.甚至spy+++。如果你想避免最后的那些事情会变得更加复杂。

>>> def match_word(word):
    return re.compile("%s%s%s" % (
        "\\b" if word[0].isalnum() or word[0]=='_' else "\\B",
        re.escape(word.lower()),
        "\\b" if word[-1].isalnum() or word[-1]=='_' else "\\B"))

>>> text = 'This is word of spy++'
>>> wl = ['spy++','cry','fpp', 'word']
>>> for word in wl:
    match = re.search(match_word(word), text)
    if match:
        print(repr(match.group()))
    else:
        print("{} did not match".format(word))


'spy++'
cry did not match
fpp did not match
'word'

答案 2 :(得分:1)

Sashi,

你的问题很糟糕,它没有表达你想要的东西。然后人们很想从你的代码内容中扣除你想要的东西,这会导致混乱。

我想你想在列表 wl 中找到单词的出现,当它们在字符串中被纯粹隔离时,也就是说每次出现时都没有任何非空格。

如果是这样,我在下面的代码中提出正则表达式的模式:

import re

ss = 'spy++ This !spy++ is spy++! word of spy++'
print ss
print [mat.start() for mat in re.finditer('spy',ss)]
print


base = ('(?:(?<=[ \f\n\r\t\v])|(?<=\A))'
        '%s'
        '(?=[ \f\n\r\t\v]|\Z)')

for x in ['spy++','cry','fpp']:
    print x,[mat.start() for mat in re.finditer(base % re.escape(x),ss)]

结果

spy++ This !spy++ is spy++! word of spy++
[0, 12, 21, 36]

spy++ [0, 36]
cry []
fpp []