我有句子。
text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012".
我想将<PERSON></PERSON>
标记放入&#34;奥巴马&#34;,结果将是这样的:
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".
我想找到substring(例如:Obama)在子字符串之前没有标记<PERSON>
,并且子字符串后面没有标记</PERSON>
,但我不知道正确的python中正则表达式的语法
**我是python的新手:&#39;&#39;
使用简单的正则表达式re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text)
将给出输出
The president of America is <PERSON>Barack <PERSON>Obama</PERSON></PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".
这是我的代码(使用python2.7)
import re
result=re.sub(r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
print "result: "+result
输出
result: <PERSON>Obama</PERSON>
而且我不知道这是第一个&#34;奥巴马&#34;或者第二个。
感谢您的帮助
答案 0 :(得分:2)
你非常接近。在您的新正则表达式r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))"
中,您有.*
之前和之后的匹配&#39;奥巴马&#39;使用之前和之后的任何字符,并且因为标签位于匹配的组中而忽略了外观。如果您删除它们,则会获得您之后的结果。
>>> import re
>>> text = "The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> namedEntity = 'Obama'
>>> result = re.sub(r"((?!<PERSON>)"+namedEntity+"(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
>>> print result
'The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
对于未来的正则表达式测试,regex101可以很好地检查在您实时更改它们时的工作方式。对于您的案例,this会显示正在发生的事情。
答案 1 :(得分:1)
只需删除正则表达式中的.*
部分。
>>>text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> surname=re.search(r'<PERSON>(.*)</PERSON>', text).group(1).split()[1]
>>> print surname
Obama
>>> re.sub(r'(?<!<PERSON>)'+surname+'(?!</PERSON>)', '<PERSON>'+surname+'</PERSON>', text)'
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
>>>
注意:您还可以使用我在surname
变量中捕获的正则表达式和捕获组来提取人的姓氏。你可以使用(?<!regex)
来断言负面的后视和(?!regex)
来断言负面的预测