我如何在xml中解析以下doctype?

时间:2013-12-26 05:43:02

标签: python xml lxml doctype

我有一个xml字符串,其中包含以下doctype语法。我该如何解析它? 我应该能够获得SYSTEM标签中的每个文件名。

'''<xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE config SYSTEM "ncfg_config.dtd"
[
    <!ENTITY vlan_map_type     SYSTEM "types/a.xml">
    <!ENTITY oui_type            SYSTEM "types/b.xml">
    <!ENTITY provisioning_profile  SYSTEM "c.xml">
    <!ENTITY vlan_name_or_list  SYSTEM "types/d.xml">
    <!ENTITY vlan_name_or_num   SYSTEM "types/e.xml">
    <!ENTITY interface_list     SYSTEM "types/f.xml">
    <!ENTITY mac_limit_type     SYSTEM "types/g.xml">
]>'''

2 个答案:

答案 0 :(得分:1)

如果格式对您的示例很严格,那么使用正则表达式会更容易:

import re

xml = '''<xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE config SYSTEM "ncfg_config.dtd"
[
    <!ENTITY vlan_map_type     SYSTEM "types/a.xml">
    <!ENTITY oui_type            SYSTEM "types/b.xml">
    <!ENTITY provisioning_profile  SYSTEM "c.xml">
    <!ENTITY vlan_name_or_list  SYSTEM "types/d.xml">
    <!ENTITY vlan_name_or_num   SYSTEM "types/e.xml">
    <!ENTITY interface_list     SYSTEM "types/f.xml">
    <!ENTITY mac_limit_type     SYSTEM "types/g.xml">
]>'''


file_names = re.findall(r'<!ENTITY .* SYSTEM "(.*?)">',xml)
for name in file_names:
    print name

输出:

types/a.xml
types/b.xml
c.xml
types/d.xml
types/e.xml
types/f.xml
types/g.xml  

答案 1 :(得分:0)

你试过HTMLParser吗?

看看这个python doc