我需要提取存储在如下文件中的trec
主题:
<top>
<num> Number: 381
<title> alternative medicine
<desc> Description:
What forms of ...?
<narr> Narrative:
A relevant document ...
</top>
<top>
<num> Number: 382
<title>
hydrogen fuel automobiles
<desc> Description:
Identify documents ....
<narr> Narrative:
A relevant document may ....
</top>
<top>
<num> Number: 655
<title>
ADD diagnosis treatment
<desc>
How is ..?
<narr>
Relevant documents ...
</top>
...
我已经尝试过这段代码,但它会对某些查询(主题)产生影响,我需要处理每个查询的标题:
f = open(join(pathTop,f), 'r') # Reading file
l = f.readline()
# topics extraction
while (l!=""):
if (l!=""):
num=0
while((l.startswith("<num>")==False)and(l!="")) :
l = f.readline()
list=re.split(r' ', l)
num=list[2].replace('\n','')
print("%s OK" %str(num))
while ((l.startswith("<title>")==False)and(l!="")) :
l = f.readline()
titre=""
while((l.startswith("<desc>")==False)and(l!="")):
titre=titre+l.replace("<title>","")
l=f.readline()
print("topic title : ",titre)