使用lxml获取内部DTD

时间:2014-01-01 08:45:40

标签: python xml lxml dtd

我想尝试使用lxml来获取内部DTD的元素,但却无法做到这一点。首先是我的xml文件(http://validator.w3.org将其声明为有效):

<?xml
    version='1.1'
    encoding='utf-8'
?>
<!DOCTYPE root [
    <!ATTLIST test
        attr (A | B | C) 'B'
    >
    <!ELEMENT test (#PCDATA)>
    <!ELEMENT root (test)*>
]>
<root></root>

但是使用lxml.etree.DTD(file ='test.xml')会抛出异常:

Traceback (most recent call last):
  File "./test.py", line 6, in <module>
    lxml.etree.DTD(file = 'test.xml')
  File "dtd.pxi", line 285, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:152121)
lxml.etree.DTDParseError: Content error in the external subset, line 5, column 1

也许lxml.etree.DTD不支持内部DTD或者我做错了。我也想尝试lxml.etree.parse(),但我无法弄清楚这个类的方法(我已经查看了parse()的引用,但它没有链接到方法)。这项任务在理论上很简单,但我无法找到所需的信息。

1 个答案:

答案 0 :(得分:1)

我不确定您在寻找什么,但您可以使用带有制表符完成功能的交互式Python解释器找到它,例如IPython。这就是我发现这个的原因:

import lxml.etree as ET
import io

content = '''<?xml
    version='1.1'
    encoding='utf-8'
?>
<!DOCTYPE root [
    <!ATTLIST test
        attr (A | B | C) 'B'
    >
    <!ELEMENT test (#PCDATA)>
    <!ELEMENT root (test)*>
]>
<root></root>'''

tree = ET.parse(io.BytesIO(content))
info = tree.docinfo
dtd = info.internalDTD

for elt in dtd.elements():
    print(elt)
    print(elt.content)
    print

# <lxml.etree._DTDElementDecl object name='test' prefix=None type='mixed' at 0xb73e044c>
# <lxml.etree._DTDElementContentDecl object name=None type='pcdata' occur='once' at 0xb73e04ac>

# <lxml.etree._DTDElementDecl object name='root' prefix=None type='element' at 0xb73e046c>
# <lxml.etree._DTDElementContentDecl object name='test' type='element' occur='mult' at 0xb73e04ac>