信息提取NAMED ENTITIES python 2.7

时间:2016-03-22 02:40:46

标签: python regex nlp

我的文字如下:

"<ENAMEX TYPE="PERSON">Edward R. Kimmel</ENAMEX>, one of Admiral <ENAMEX TYPE="PERSON">Jack</ENAMEX>'s two surviving sons and..."

我想要一个输出如下:

PERSON Edward R. Kimmel

PERSON杰克

使用RegEX的想法吗?

非常感谢

2 个答案:

答案 0 :(得分:2)

你尝试过beautifulsoup吗?

from bs4 import BeautifulSoup
txt = """<ENAMEX TYPE="PERSON">Edward R. Kimmel</ENAMEX>, one of Admiral <ENAMEX TYPE="PERSON">Jack</ENAMEX>'s twosurviving sons and..."""
soup = BeautifulSoup(txt,"html.parser")
for i in soup.findAll(attrs={'type' : 'PERSON'}):
    print(i.text)

答案 1 :(得分:0)

只需使用.findall

即可
import re
x = '"<ENAMEX TYPE="PERSON">Edward R. Kimmel</ENAMEX>, one of Admiral <ENAMEX TYPE="PERSON">Jack</ENAMEX>"'
mac = []
mac = re.findall("TYPE=\"PERSON\">(.+?)<",x)


for i in mac:
    print "PERSON "+i