如何将社交图书搜索XML集合转换为TREC集合?

时间:2019-05-18 09:03:29

标签: python xml python-2.7 beautifulsoup

我正在使用Terrier IR平台进行社会图书搜索数据集的实验,该数据集包含280万个XML文档,每个文档具有67个以上的元数据字段。下面是一个示例XML文件:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- version 1.0 / 2009-11-06T15:56:12+01:00 -->
<!DOCTYPE book SYSTEM "books.dtd">
<book>
<isbn>0373078005</isbn>
<title>Never Trust A Lady (Silhouette Intimate Moments, No 800) (Harlequin Intimate Moments, No 800)</title>
<ean>9780373078004</ean>
<binding>Paperback</binding>
<label>Silhouette</label>
<browseNode id="388186011">Refinements</browseNode>
<browseNode id="394174011">Binding (binding)</browseNode>
<browseNode id="400272011">Paperback</browseNode>
</browseNodes>
</book>

但是,在建立索引之前,我想将集合转换为TREC集合格式。特定文件夹中的所有XML文件都应转换为单个TREC文件,示例如下所示:

<book>
<isbn>0373078005</isbn>
<text>0373078005 Never Trust A Lady (Silhouette Intimate Moments, No 800 (Harlequin Intimate Moments, No 800) 9780373078004 Paperback Silhouette $3.99 Silhouette Silhouette 1997-07-01 Silhouette Refinements Binding (binding) Paperback </text>
</book>
<book>
<isbn>0373084005</isbn>
<text>0373084005 Written On The Wind (Silhouette Romance, No 400) 9780373084005 Paperback Silhouette $1.95 Silhouette Silhouette 1985-11-01 Silhouette 70 420 650 10 Rita Rainville Author Artificial intellingence Romance contemporary sr category Romance Subjects Contemporary Series Silhouette Romance Books General Refinements Binding (binding) Paperback Format (feature_browse-bin) Printed Books General AAS</text>
</book>
...

我创建了C:\xmlfiles\python-trec,并在其中创建了两个文件夹,即data1data2,并在两个文件夹中都放置了一些xml文件。我使用了以下网址提供的python脚本:http:lab.hypotheses.org/1129,我将其修改为以下代码:

import os, sys
from bs4 import BeautifulSoup
datadest="no collection path"
datdir = "C:\\xmlfiles\\python-trec\\"
for folds in os.listdir(datdir):
    os.mkdir(datadest+folds)
    trectxt=""
    for files in os.listdir(datdir+folds):
        if files.endswith(".xml"):
            content= open(datdir+"/"+folds+"/"+files,'r').read()
            soup = BeautifulSoup(content)
            texts = soup.findAll("book")
            for text in texts:
                isbn =texts[0].findAll("isbn")[0].getText()
                trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
                trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
                f=open(datadest+folds+"/"+folds+".xml","w")
                f.write(trectxt)
                f.close()

我收到以下错误消息:

C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
  File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
   os.mkdir(datadest+folds)
 WindowsError: [Error 183] Cannot create a file when that file already exists: 'no collection pathdata1'

在将行datadest="no collection path"更改为datadest="C:\\xmlfiles\\python-trec\\"之后,我收到以下错误消息:

C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
  File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
   os.mkdir(datadest+folds)
WindowsError: [Error 183] Cannot create a file when that file already exists: 'C:\\xmlfiles\\python-trec\\data1'

然后,我创建了一个新文件夹C:\\xmlfiles\\python-trec\\python-trec-results,并将行:datadest="no collection path"修改为datadest="C:\\xmlfiles\\python-trec\\python-trec-results",得到了以下错误消息:

C:\Python27\Scripts\trec-conversion.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file 
C:\Python27\Scripts\trec-conversion.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

soup = BeautifulSoup(content)
Traceback (most recent call last):
File "C:\Python27\Scripts\trec-conversion.py", line 18, in <module>
    f.write(trectxt)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1141: ordinal not in range(128)

该代码为data1文件夹生成了所需的TREC文件,但使用上述消息未能为data2文件夹生成相同的文件。

请帮助

-Rocky

1 个答案:

答案 0 :(得分:0)

我进行了以下更改:

# encoding=utf8
import os, sys
reload(sys)
sys.setdefaultencoding('utf8')

from bs4 import BeautifulSoup

datadest="C:\\xmlfiles\\python-trec-results\\"
datdir = "C:\\xmlfiles\\python-trec\\"

for folds in os.listdir(datdir):
    os.mkdir(datadest+folds)
    trectxt=""
    for files in os.listdir(datdir+folds):
        if files.endswith(".xml"):
            content= open(datdir+"/"+folds+"/"+files,'r').read()
            soup = BeautifulSoup(content, 'lxml', from_encoding='utf-8')
            texts = soup.findAll("book")
            for text in texts:
                isbn =texts[0].findAll("isbn")[0].getText()
                trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
                trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
                f=open(datadest+folds+"/"+folds+".xml","w")
                f.write(trectxt)
                f.close()

该程序现在可以运行了! 但是,如下所示,它在和节点的值内提供了过多的空白:

<book>
<isbn>0268020000</isbn>
<text>
0268020000 
Aquinas On Matter and Form and the Elements: A Translation and Interpretation of the DE PRINCIPIIS NATURAE and the DE MIXTIONE ELEMENTORUM of St. Thomas Aquinas 
9780268020002 
Paperback 
University of Notre Dame Press 
$25.00 
University of Notre Dame Press 
University of Notre Dame Press 


1998-03-28 
University of Notre Dame Press 

2000-11-16 
Wonderful Exposition 
Bobick has done it again.  After reading Bobick's insightful translation and exposition of Aquinas' "De Ente et Esentia", I was pleased to find that his knack for explaining Aquinas' complex ideas in metaphysics and natural philospohy is repeated in this book.  For those who wish to understand Aquinas in depth, this book is a must. 
5 
0 
0 

Physics 
Cosmology 
Professional & Technical 



</text>
</book>
<book>
<isbn>0268037000</isbn>
<text>
0268037000
... 

我想删除不必要的空格并返回以使其看起来像以下内容:

<book>
<isbn>0268020000</isbn>
<text> ....text goes here....</text>
</book>
<book>
<isbn> 0268037000 </isbn>
<text>....text goes here.....</text>
</book>
...

我尝试了有关删除空格的可用答案,但它们对我不起作用... 请帮忙。