Question

<TEXT>John became the first writer to get the 1000 reviews. His writing drew praise from all over the globe.
Now he wishes to do even better.</TEXT>
<TAGS>
<PERSONAL spans="13~29" text="the first writer"/>
<PERSONAL spans="61~85" text="drew praise from all over"/>
</TAGS>

这是一个XML文件，我必须解析它并从中生成一个训练文件。我必须在PERSONAL标签中搜索文本，并将其与TEXT标签中的文本进行匹配。只要匹配，PERSONAL标签中的单词就需要标记为“PERSONAL”

OUTPUT需要像这样

John 
became
the PERSONAL
first PERSONAL
writer PERSONAL 
to 
get 
1000 
reviews 

His 
writing 
drew PERSONAL
praise PERSONAL
from PERSONAL
all PERSONAL
over PERSONAL
the 
globe

到目前为止

代码：

  root=ET.fromstring(data)
    maintext = root.find('TEXT').text
    sentences= re.split('.', context_text)             
    for tags in list(root):
        for pers in tags.findall('PERSONAL'):
             personal=pers.get('text')
             span=pers.get('spans')
             answer_start=((re.split('([^-]*)~', span))[1])
             answer_end=((re.split('([^-]*)~', span))[2])

        for sents in sentences:
             words=re.split(' ', sents)
             for word in words:
                   if personal in sents:
                       if word in (main_text[int(answer_start):int(answer_end)]):
                          print(word+' PERSONAL')
                       else: 
                          print(word)
                    else:
                        print(word)

在这里，“约翰成为第一个获得 1000评论”的作家的额外'the'也被标记为'PERSONAL'。一些逻辑问题正确地标记它，我似乎无法弄明白。

使用python创建特定数据模式

0 个答案: