<TEXT>John became the first writer to get the 1000 reviews. His writing drew praise from all over the globe.
Now he wishes to do even better.</TEXT>
<TAGS>
<PERSONAL spans="13~29" text="the first writer"/>
<PERSONAL spans="61~85" text="drew praise from all over"/>
</TAGS>
这是一个XML文件,我必须解析它并从中生成一个训练文件。 我必须在PERSONAL标签中搜索文本,并将其与TEXT标签中的文本进行匹配。只要匹配,PERSONAL标签中的单词就需要标记为“PERSONAL”
OUTPUT需要像这样
John
became
the PERSONAL
first PERSONAL
writer PERSONAL
to
get
1000
reviews
His
writing
drew PERSONAL
praise PERSONAL
from PERSONAL
all PERSONAL
over PERSONAL
the
globe
到目前为止代码:
root=ET.fromstring(data)
maintext = root.find('TEXT').text
sentences= re.split('.', context_text)
for tags in list(root):
for pers in tags.findall('PERSONAL'):
personal=pers.get('text')
span=pers.get('spans')
answer_start=((re.split('([^-]*)~', span))[1])
answer_end=((re.split('([^-]*)~', span))[2])
for sents in sentences:
words=re.split(' ', sents)
for word in words:
if personal in sents:
if word in (main_text[int(answer_start):int(answer_end)]):
print(word+' PERSONAL')
else:
print(word)
else:
print(word)
在这里,“约翰成为第一个获得 1000评论”的作家的额外'the'也被标记为'PERSONAL'。一些逻辑问题正确地标记它,我似乎无法弄明白。