如何用python中的正则表达式替换未包含在标记中的子字符串

时间:2016-03-06 18:10:25

标签: python regex substring

我有句子。

text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012".

我想将<PERSON></PERSON>标记放入&#34;奥巴马&#34;,结果将是这样的:
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

我想找到substring(例如:Obama)在子字符串之前没有标记<PERSON>,并且子字符串后面没有标记</PERSON>,但我不知道正确的python中正则表达式的语法 **我是python的新手:&#39;&#39;

使用简单的正则表达式re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text)将给出输出
The president of America is <PERSON>Barack <PERSON>Obama</PERSON></PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

这是我的代码(使用python2.7)

import re

result=re.sub(r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)

print "result: "+result

输出
result: <PERSON>Obama</PERSON>
而且我不知道这是第一个&#34;奥巴马&#34;或者第二个。

感谢您的帮助

2 个答案:

答案 0 :(得分:2)

你非常接近。在您的新正则表达式r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))"中,您有.*之前和之后的匹配&#39;奥巴马&#39;使用之前和之后的任何字符,并且因为标签位于匹配的组中而忽略了外观。如果您删除它们,则会获得您之后的结果。

>>> import re
>>> text = "The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> namedEntity = 'Obama'
>>> result = re.sub(r"((?!<PERSON>)"+namedEntity+"(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
>>> print result
'The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'

对于未来的正则表达式测试,regex101可以很好地检查在您实时更改它们时的工作方式。对于您的案例,this会显示正在发生的事情。

答案 1 :(得分:1)

只需删除正则表达式中的.*部分。

>>>text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> surname=re.search(r'<PERSON>(.*)</PERSON>', text).group(1).split()[1]
>>> print surname
Obama
>>> re.sub(r'(?<!<PERSON>)'+surname+'(?!</PERSON>)', '<PERSON>'+surname+'</PERSON>', text)'  
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
>>> 

注意:您还可以使用我在surname变量中捕获的正则表达式和捕获组来提取人的姓氏。你可以使用(?<!regex)来断言负面的后视和(?!regex)来断言负面的预测