Question

我需要提取存储在如下文件中的trec主题：

<top>

<num> Number: 381
<title> alternative medicine 

<desc> Description: 
What forms of ...?

<narr> Narrative: 
A relevant document ...

</top>

<top>

<num> Number: 382
<title> 
hydrogen fuel automobiles 

<desc> Description: 
Identify documents ....

<narr> Narrative: 
A relevant document may .... 

</top>
<top>
<num> Number: 655

<title>
ADD diagnosis treatment

<desc>
How is ..?

<narr>
Relevant documents ...
</top>
...

我已经尝试过这段代码，但它会对某些查询（主题）产生影响，我需要处理每个查询的标题：

f = open(join(pathTop,f), 'r')   # Reading file
    l = f.readline()
    # topics extraction
    while (l!=""):
        if (l!=""):
            num=0
            while((l.startswith("<num>")==False)and(l!="")) :
                l = f.readline()
            list=re.split(r' ', l)
            num=list[2].replace('\n','')
            print("%s OK" %str(num))
            while ((l.startswith("<title>")==False)and(l!="")) :
                l = f.readline()
            titre=""
            while((l.startswith("<desc>")==False)and(l!="")):
                titre=titre+l.replace("<title>","")
                l=f.readline()
            print("topic title : ",titre)

提取TREC主题

0 个答案: