提取TREC主题

时间:2017-02-16 07:36:41

标签: python parsing

我需要提取存储在如下文件中的trec主题:

<top>

<num> Number: 381
<title> alternative medicine 

<desc> Description: 
What forms of ...?

<narr> Narrative: 
A relevant document ...

</top>

<top>

<num> Number: 382
<title> 
hydrogen fuel automobiles 

<desc> Description: 
Identify documents ....

<narr> Narrative: 
A relevant document may .... 

</top>
<top>
<num> Number: 655

<title>
ADD diagnosis treatment

<desc>
How is ..?

<narr>
Relevant documents ...
</top>
...

我已经尝试过这段代码,但它会对某些查询(主题)产生影响,我需要处理每个查询的标题:

f = open(join(pathTop,f), 'r')   # Reading file
    l = f.readline()
    # topics extraction
    while (l!=""):
        if (l!=""):
            num=0
            while((l.startswith("<num>")==False)and(l!="")) :
                l = f.readline()
            list=re.split(r' ', l)
            num=list[2].replace('\n','')
            print("%s OK" %str(num))
            while ((l.startswith("<title>")==False)and(l!="")) :
                l = f.readline()
            titre=""
            while((l.startswith("<desc>")==False)and(l!="")):
                titre=titre+l.replace("<title>","")
                l=f.readline()
            print("topic title : ",titre)

0 个答案:

没有答案