使用python在xml中使用特定匹配字符串解析子标记

时间:2017-03-07 07:46:55

标签: python xml

我想解析具有标签主题作为父标签的xml字符串和Topic1,Topic2作为子标签。

<?xml version="1.0" encoding="UTF-8"?><SignificantDevelopments Major="3" Minor="0" Revision="1" xmlns="urn:reuterscompanycontent:significantdevelopments03"><Topics><Topic1 Code="254">Regulatory / Company Investigation</Topic1><Topic2 Code="207">Mergers &amp; Acquisitions</Topic2><ParentTopic1 Code="6">Litigation / Regulatory</ParentTopic1><ParentTopic2 Code="4">Ownership / Control</ParentTopic2></Topics></SignificantDevelopments>

我只想解析这个xml,这样我就可以得到每个Topic标签的属性值,我只想让它进入for循环。

我尝试过使用以下代码:

    import xml.etree.cElementTree as ET
    tree = ET.ElementTree(file='sample.xml')

    #get the root element
    root = tree.getroot()
    namespace = {'xmlns': 'urn:reuterscompanycontent:significantdevelopments03'}

    for devs in root.findall('xmlns:Topics' ,namespace):
        for child_tags in devs.findall('xmlns:./', namespace):
            print 'child: ', child_tags.tag

我只想在倒数第二行添加一些像Topic / d这样的外卡,这样我就可以解析每个匹配主题的标签

1 个答案:

答案 0 :(得分:1)

您可以检查tag属性是否以命名空间加上前缀Topic开头,例如

from xml.etree import cElementTree as ET
root = ET.fromstring('<?xml version="1.0" encoding="UTF-8"?><SignificantDevelopments Major="3" Minor="0" Revision="1" xmlns="urn:reuterscompanycontent:significantdevelopments03"><Topics><Topic1 Code="254">Regulatory / Company Investigation</Topic1><Topic2 Code="207">Mergers &amp; Acquisitions</Topic2><ParentTopic1 Code="6">Litigation / Regulatory</ParentTopic1><ParentTopic2 Code="4">Ownership / Control</ParentTopic2></Topics></SignificantDevelopments>')
topics = [el for el in root.findall('*/*') if el.tag.startswith('{urn:reuterscompanycontent:significantdevelopments03}Topic')]
for topic in topics:
    print (topic.text)

或更短

from xml.etree import cElementTree as ET
root = ET.fromstring('<?xml version="1.0" encoding="UTF-8"?><SignificantDevelopments Major="3" Minor="0" Revision="1" xmlns="urn:reuterscompanycontent:significantdevelopments03"><Topics><Topic1 Code="254">Regulatory / Company Investigation</Topic1><Topic2 Code="207">Mergers &amp; Acquisitions</Topic2><ParentTopic1 Code="6">Litigation / Regulatory</ParentTopic1><ParentTopic2 Code="4">Ownership / Control</ParentTopic2></Topics></SignificantDevelopments>')

for topic in [el for el in root.findall('*/*') if el.tag.startswith('{urn:reuterscompanycontent:significantdevelopments03}Topic')]:
    print (topic.text)

或者将支票放入if语句中的for语句。