从符号“ <>”和嵌套大小写“ << >>”之间的句子中提取单词

时间:2019-06-24 12:52:03

标签: python regex

命名实体识别新闻数据集(文本)

以下是示例:

<LOC Qatar> and <LOC Japan>, who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups.

我正在尝试提取<>之间的实体,嵌套标签中的问题是输出

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

这是错误的,因为“ EVENT S Asian”,“ E Cup”应该是一个字符串而不是两个。

我尝试过regEx,但效果不佳。

import re
s = """<LOC Qatar> and <LOC Japan>, 
who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups."""
re.findall('\<.*?\>',s)

实际结果:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

预期结果:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian> <E Cup>>',
 '<DATE February>']

1 个答案:

答案 0 :(得分:2)

您要应用注释中提到的递归模式。 regex 模块为您提供了机会(不是re模块)。

代码在这里:

# Import module
import regex as reg

# Your string
s = """<LOC Qatar> and <LOC Japan>, 
who met in the < EVENT < S Asian > < E Cup >> final in < DATE February > , are in third place in their groups. """

# Match pattern
my_list = reg.findall("<((?:[^<>]|(?R))*)>", s)
print(my_list)
# ['LOC Qatar', 'LOC Japan', ' EVENT < S Asian > < E Cup >', ' DATE February ']

如果您真的想用<>包围这些单词,可以添加它们:

my_list = ['<' + elt + '>' for elt in my_list]
print(my_list)
# ['<LOC Qatar>', '<LOC Japan>', '< EVENT < S Asian > < E Cup >>', '< DATE February >']