我正在尝试解析dblp数据(xml格式)。到目前为止,我的代码是:
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
我尝试运行代码并收到以下错误:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
我该如何解决此错误?
注意:我在同一位置有xml和dtd文件。
答案 0 :(得分:0)
我最近在解析DBLP的XML数据库时遇到了相同的问题。就我而言,我缺少.dtd
的相应dblp.xml
文件(该文件提供了解析某些自定义实体(包括ouml
的必要信息)。文件顶部应如下所示:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
第二行中指定的.dtd
文件应与您尝试解析的dblp.xml
文件位于同一目录中。您可以从以下位置下载适当的.dtd
文件XML文件:http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
此外,给定dblp.xml
的大小,您可能还想使用lxml.etree.iterparse
来流式传输文件的内容。下面是一些我用来获取数据库中某些类型的出版物条目的代码。
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()