命名实体识别新闻数据集(文本)
以下是示例:
<LOC Qatar> and <LOC Japan>, who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups.
我正在尝试提取<>之间的实体,嵌套标签中的问题是输出
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian>',
'<E Cup>',
'<DATE February>']
这是错误的,因为“ EVENT S Asian”,“ E Cup”应该是一个字符串而不是两个。
我尝试过regEx,但效果不佳。
import re
s = """<LOC Qatar> and <LOC Japan>,
who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups."""
re.findall('\<.*?\>',s)
实际结果:
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian>',
'<E Cup>',
'<DATE February>']
预期结果:
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian> <E Cup>>',
'<DATE February>']
答案 0 :(得分:2)
您要应用注释中提到的递归模式。 regex
模块为您提供了机会(不是re
模块)。
代码在这里:
# Import module
import regex as reg
# Your string
s = """<LOC Qatar> and <LOC Japan>,
who met in the < EVENT < S Asian > < E Cup >> final in < DATE February > , are in third place in their groups. """
# Match pattern
my_list = reg.findall("<((?:[^<>]|(?R))*)>", s)
print(my_list)
# ['LOC Qatar', 'LOC Japan', ' EVENT < S Asian > < E Cup >', ' DATE February ']
如果您真的想用<>
包围这些单词,可以添加它们:
my_list = ['<' + elt + '>' for elt in my_list]
print(my_list)
# ['<LOC Qatar>', '<LOC Japan>', '< EVENT < S Asian > < E Cup >>', '< DATE February >']