lxml:获取所有叶子节点?

时间:2015-04-10 17:48:19

标签: python xml lxml

提供XML文件,有没有办法使用lxml来获取所有叶子节点的名称和属性?

以下是感兴趣的XML文件:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <id_info>
    <org_study_id>3370-2(-4)</org_study_id>
    <nct_id>NCT00753818</nct_id>
    <nct_alias>NCT00222157</nct_alias>
  </id_info>
  <brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>Mead Johnson Nutrition</agency>
      <agency_class>Industry</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>Mead Johnson Nutrition</source>
  <oversight_info>
    <authority>United States: Institutional Review Board</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The purpose of this study is to compare the effects on visual development, growth, cognitive
      development, tolerance, and blood chemistry parameters in term infants fed one of four study
      formulas containing various levels of DHA and ARA.
    </textblock>
  </brief_summary>
  <overall_status>Completed</overall_status>
  <phase>N/A</phase>
  <study_type>Interventional</study_type>
  <study_design>N/A</study_design>
  <primary_outcome>
    <measure>visual development</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>Cognitive development</measure>
  </secondary_outcome>
  <number_of_arms>4</number_of_arms>
  <condition>Cognitive Development</condition>
  <condition>Growth</condition>
  <arm_group>
    <arm_group_label>1</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>2</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>3</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>4</arm_group_label>
    <arm_group_type>Other</arm_group_type>
    <description>Control</description>
  </arm_group>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>DHA and ARA</intervention_name>
    <description>various levels of DHA and ARA</description>
    <arm_group_label>1</arm_group_label>
    <arm_group_label>2</arm_group_label>
    <arm_group_label>3</arm_group_label>
  </intervention>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>Control</intervention_name>
    <arm_group_label>4</arm_group_label>
  </intervention>
</clinical_study>

我想要的是一本如下的字典:

{
   'id_info_org_study_id': '3370-2(-4)', 
   'id_info_nct_id': 'NCT00753818', 
   'id_info_nct_alias': 'NCT00222157', 
   'brief_title': 'Developmental Effects...'
}

这可能与lxml - 或者任何其他Python库有关吗?

更新:

我最终这样做了:

response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})

def _recurse_over_nodes(self, tree, parent_key, data):
    for branch in tree:
        key = branch.tag
        if branch.getchildren():
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            data = self._recurse_over_nodes(branch, key, data)
        else:
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            if key in data:
                data[key] = data[key] + ', %s' % branch.text
            else:
                data[key] = branch.text
    return data

3 个答案:

答案 0 :(得分:5)

使用iter方法。

http://lxml.de/api/lxml.etree._Element-class.html#iter

这是一个有效的例子。

#!/usr/bin/python
from lxml import etree

xml='''
<book>
    <chapter id="113">

        <sentence id="1" drums='Neil'>
            <word id="128160" bass='Geddy'>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>

    </chapter>
</book>
'''

filename='/usr/share/sri/configurations/saved/test1.xml'

if __name__ == '__main__':
    root = etree.fromstring(xml)

    # iter will return every node in the document
    #
    for node in root.iter('*'):

        # nodes of length zero are leaf nodes
        #
        if 0 ==  len(node):
            print node

这是输出:

$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>

答案 1 :(得分:2)

假设你已经完成getroot(),下面这么简单的东西可以用你期望的东西构建一个字典:

import lxml.etree

tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()

d = {}
for node in root:
    key = node.tag
    if node.getchildren():
        for child in node:
            key += '_' + child.tag
            d.update({key: child.text})
    else:
        d.update({key: node.text})

应该做的技巧,没有优化,也没有递归地搜索所有子节点,但你知道从哪里开始。

答案 2 :(得分:1)

试试这个:

from xml.etree import ElementTree

def crawl(root, prefix='', memo={}):
    new_prefix = root.tag
    if len(prefix) > 0:
        new_prefix = prefix + "_" + new_prefix
    for child in root.getchildren():
        crawl(child, new_prefix, memo)
    if len(root.getchildren()) == 0:
        memo[new_prefix] = root.text
    return memo

e = ElementTree.parse("data.xml")
nodes = crawl(e.getroot())
for k, v in nodes.items():
    print k, v

crawl最初接受xml树的根。然后它会遍历所有的孩子(递归地)跟踪它到达那里的所有标签(这是整个前缀的事情)。当它最终找到没有子元素的元素时,它会将该数据保存在memo

部分输出:

clinical_study_intervention_intervention_name Control clinical_study_phase
N/A clinical_study_arm_group_arm_group_type Other 
clinical_study_id_info_nct_id NCT00753818