我有一个单词列表
我正在基于这个单词列表创建一个正则表达式对象列表
import re
word = 'This is word of spy++'
wl = ['spy++','cry','fpp']
regobjs = [re.compile(r"\b%s\b" % word.lower() ) for word in wl]
for reobj in regobjs:
print re.search(regobj, word).group()
但是由于符号++,我在创建正则表达式obj时遇到错误(error: multiple repeat)
如何使正则表达式处理单词列表中所有单词的情况?
requirements:
regex should detect the exact word from the given text
even if the word having non alpha numeric chars like (++) above code detect the exact words except those having ++ char.
答案 0 :(得分:6)
除了re.escape()
之外,您还需要在非字母数字字符之前/之后删除\b
字边界,否则匹配将失败。
像这样的东西(不是很优雅,但我希望它能说明问题):
import re
words = 'This is word of spy++'
wl = ['spy++','cry','fpp']
regobjs = []
for word in wl:
eword = re.escape(word.lower())
if eword[0].isalnum() or eword[0]=="_":
eword = r"\b" + eword
if eword[-1].isalnum() or eword[-1]=="_":
eword = eword + r"\b"
regobjs.append(re.compile(eword))
for regobj in regobjs:
print re.search(regobj, words).group()
答案 1 :(得分:2)
当您的单词以字母,数字或下划线开头或结尾时,您希望使用\b
;如果不是,则\B
。这意味着您不会选择spy++x
,但会选择spy++.
甚至spy+++
。如果你想避免最后的那些事情会变得更加复杂。
>>> def match_word(word):
return re.compile("%s%s%s" % (
"\\b" if word[0].isalnum() or word[0]=='_' else "\\B",
re.escape(word.lower()),
"\\b" if word[-1].isalnum() or word[-1]=='_' else "\\B"))
>>> text = 'This is word of spy++'
>>> wl = ['spy++','cry','fpp', 'word']
>>> for word in wl:
match = re.search(match_word(word), text)
if match:
print(repr(match.group()))
else:
print("{} did not match".format(word))
'spy++'
cry did not match
fpp did not match
'word'
答案 2 :(得分:1)
Sashi,
你的问题很糟糕,它没有表达你想要的东西。然后人们很想从你的代码内容中扣除你想要的东西,这会导致混乱。
我想你想在列表 wl 中找到单词的出现,当它们在字符串中被纯粹隔离时,也就是说每次出现时都没有任何非空格。
如果是这样,我在下面的代码中提出正则表达式的模式:
import re
ss = 'spy++ This !spy++ is spy++! word of spy++'
print ss
print [mat.start() for mat in re.finditer('spy',ss)]
print
base = ('(?:(?<=[ \f\n\r\t\v])|(?<=\A))'
'%s'
'(?=[ \f\n\r\t\v]|\Z)')
for x in ['spy++','cry','fpp']:
print x,[mat.start() for mat in re.finditer(base % re.escape(x),ss)]
结果
spy++ This !spy++ is spy++! word of spy++
[0, 12, 21, 36]
spy++ [0, 36]
cry []
fpp []