我想尝试使用lxml来获取内部DTD的元素,但却无法做到这一点。首先是我的xml文件(http://validator.w3.org将其声明为有效):
<?xml
version='1.1'
encoding='utf-8'
?>
<!DOCTYPE root [
<!ATTLIST test
attr (A | B | C) 'B'
>
<!ELEMENT test (#PCDATA)>
<!ELEMENT root (test)*>
]>
<root></root>
但是使用lxml.etree.DTD(file ='test.xml')会抛出异常:
Traceback (most recent call last):
File "./test.py", line 6, in <module>
lxml.etree.DTD(file = 'test.xml')
File "dtd.pxi", line 285, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:152121)
lxml.etree.DTDParseError: Content error in the external subset, line 5, column 1
也许lxml.etree.DTD不支持内部DTD或者我做错了。我也想尝试lxml.etree.parse(),但我无法弄清楚这个类的方法(我已经查看了parse()的引用,但它没有链接到方法)。这项任务在理论上很简单,但我无法找到所需的信息。
答案 0 :(得分:1)
我不确定您在寻找什么,但您可以使用带有制表符完成功能的交互式Python解释器找到它,例如IPython。这就是我发现这个的原因:
import lxml.etree as ET
import io
content = '''<?xml
version='1.1'
encoding='utf-8'
?>
<!DOCTYPE root [
<!ATTLIST test
attr (A | B | C) 'B'
>
<!ELEMENT test (#PCDATA)>
<!ELEMENT root (test)*>
]>
<root></root>'''
tree = ET.parse(io.BytesIO(content))
info = tree.docinfo
dtd = info.internalDTD
for elt in dtd.elements():
print(elt)
print(elt.content)
print
# <lxml.etree._DTDElementDecl object name='test' prefix=None type='mixed' at 0xb73e044c>
# <lxml.etree._DTDElementContentDecl object name=None type='pcdata' occur='once' at 0xb73e04ac>
# <lxml.etree._DTDElementDecl object name='root' prefix=None type='element' at 0xb73e046c>
# <lxml.etree._DTDElementContentDecl object name='test' type='element' occur='mult' at 0xb73e04ac>