使用lxml解析dblp数据时实体'ouml'错误

时间:2018-04-02 16:13:05

标签: python json xml

我正在尝试解析dblp数据(xml格式)。到目前为止,我的代码是:

#-*-coding:utf-8-*-  
from  lxml  import  etree # lxml import library  
parser = etree.XMLParser (load_dtd =True) 
Tree = etree.parse( "dblp.xml" ,parser) 
Root = tree.getroot()

我尝试运行代码并收到以下错误:

Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure  
  File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
  File "dblp.xml", line 70

  lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70, 
  column 27

我该如何解决此错误?

注意:我在同一位置有xml和dtd文件。

1 个答案:

答案 0 :(得分:0)

我最近在解析DBLP的XML数据库时遇到了相同的问题。就我而言,我缺少.dtd的相应dblp.xml文件(该文件提供了解析某些自定义实体(包括ouml的必要信息)。文件顶部应如下所示:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">

第二行中指定的.dtd文件应与您尝试解析的dblp.xml文件位于同一目录中。您可以从以下位置下载适当的.dtd文件XML文件:http://dblp.org/xml/release/

$ ls
dblp-2017-08-29.dtd  dblp-2018-11-01.xml

此外,给定dblp.xml的大小,您可能还想使用lxml.etree.iterparse来流式传输文件的内容。下面是一些我用来获取数据库中某些类型的出版物条目的代码。

fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
    if elem.tag not in ['article', 'inproceedings', 'proceedings']:
        continue

    title = elem.find('title')  # type: Optional[str]
    year = elem.find('year')  # type: Optional[int]
    authors = elem.find('author')  # type: Optional[str]
    venue = elem.find('venue')  # type: Optional[str]

    ...

    elem.clear()