如何使用re.sub在python中为某些字符串添加标签?

时间:2010-11-19 02:07:27

标签: python regex

我正在尝试为某些给定的查询字符串添加标记,并且标记应该包围所有匹配的字符串。 例如,我想在句子iphone games mac中匹配查询I love downloading iPhone games from my mac.的所有单词周围包含 标记应该是I love downloading <em>iPhone games</em> from my <em>mac</em>.

目前,我试过

sentence = "I love downloading iPhone games from my mac."
query = r'((iphone|games|mac)\s*)+'
regex = re.compile(query, re.I)
sentence = regex.sub(r'<em>\1</em> ', sentence)

句子输出

I love downloading <em>games </em> on my <em>mac</em> !

其中\ 1只替换为一个单词(games而不是iPhone games),并且在单词后面有一些不必要的空格。如何编写正则表达式以获得所需的输出?谢谢!

修改 我刚刚意识到弗雷德和克里斯的解决方案在我说话时都有问题。例如,如果我的查询是game,那么它会变成<em>game</em>s,而我希望它不会突出显示。另一个例子是the中的either不应突出显示。

编辑2: 我采用了克里斯的新解决方案并且有效。

2 个答案:

答案 0 :(得分:5)

首先,要获得所需的空格,请将\s*替换为\s*?,以使其不贪婪。

首先修复:

>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'

不幸的是,一旦\s*非贪婪,它会分割短语,如您所见。没有它,就像这样,将两者组合在一起:

>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'

我还没想到如何解决这个问题。

另请注意,在这些中,我在+周围加了一组额外的括号,以便所有匹配都被捕获 - 这就是区别。

进一步更新:实际上,我可以想办法绕过它。你决定是否要这样。

>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I)
>>> regex.sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

更新:考虑到您对单词边界的看法,我们只需添加\b的几个实例,即单词边界匹配器。

>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I)
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac')
'I love downloading <em>iPhone games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac')
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac')
'I love downloading iPhoney <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac')
'I love downloading iPhoney gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac')
'I love downloading miPhone gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac')
'I love downloading miPhone <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac')
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'

答案 1 :(得分:1)

>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I)
>>> r.sub(r'\1<em>\2</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

完全包含加重复数的额外组避免丢失单词,同时在单词之前移动空格 - 但最初取出前导空格 - 处理该问题。单词边界断言要求对它们之间的3个单词进行完全单词匹配。但是,NLP很难,但仍然会出现无法按预期工作的情况。