在Python中解析嵌套的复杂XML

时间:2020-10-10 17:50:18

标签: python xml-parsing elementtree

我正在尝试解析非常复杂的xml文件,并将其内容存储在dataframe中。我尝试了xml.etree.ElementTree,并设法检索了一些元素,但是以某种方式多次检索它,就好像有更多对象一样。我正在尝试提取以下内容:category, created, last_updated, accession type, name type identifier, name type synonym as a list

<cellosaurus>
<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">
  <accession-list>
    <accession type="primary">CVCL_B375</accession>
  </accession-list>
  <name-list>
    <name type="identifier">#490</name>
    <name type="synonym">490</name>
    <name type="synonym">Mab 7</name>
    <name type="synonym">Mab7</name>
  </name-list>
  <comment-list>
    <comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>
    <comment category="Monoclonal antibody isotype"> IgM, kappa </comment>
  </comment-list>
  <species-list>
    <cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>
  </species-list>
  <derived-from>
    <cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>
  </derived-from>
  <reference-list>
    <reference resource-internal-ref="Patent=US5616470"/>
  </reference-list>
  <xref-list>
    <xref database="CLO" category="Ontologies" accession="CLO_0001018">
      <url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>
    </xref>
    <xref database="ATCC" category="Cell line collections" accession="HB-12029">
      <url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>
    </xref>
    <xref database="Wikidata" category="Other" accession="Q54422073">
      <url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>
    </xref>
  </xref-list>
</cell-line>
</cellosaurus>

3 个答案:

答案 0 :(得分:2)

解析XML的最简单方法是IMHO,使用lxml。

from lxml import etree
data = """[your xml above]"""
doc = etree.XML(data)
for att in doc.xpath('//cell-line'):
    print(att.attrib['category'])
    print(att.attrib['last_updated'])
    print(att.xpath('.//accession/@type')[0])
    print(att.xpath('.//name[@type="identifier"]/text()')[0])
    print(att.xpath('.//name[@type="synonym"]/text()'))

输出:

Hybridoma
2020-03-12
primary
#490
['490', 'Mab 7', 'Mab7']

然后您可以将输出分配给变量,追加到列表等

答案 1 :(得分:1)

鉴于在某些情况下您希望解析标记属性,而在另一些情况下您想要解析tag_values,因此您的问题还不清楚。

我的理解如下。您需要以下值:

  1. 标签 cell-line 的属性 category 的值。
  2. 标记 cell-line created 属性的值。
  3. 标签 cell-line 的属性 last_updated 的值。
  4. 标签 accession 的属性 type 的值。
  5. 与具有属性 identifier 的标签 name 相对应的文本。
  6. 与具有属性同义词的标签 name 对应的文本。

可以使用模块xml.etree.Etree从xml文件中提取这些值。尤其要注意使用Element类的findalliter方法。

假设xml位于名为 input.xml 的文件中,则以下代码段即可解决问题。

import xml.etree.ElementTree as et


def main():
    tree = et.parse('cellosaurus.xml')
    root = tree.getroot()

    results = []
    for element in root.findall('.//cell-line'):
        key_values = {}
        for key in ['category', 'created', 'last_updated']:
            key_values[key] = element.attrib[key]
        for child in element.iter():
            if child.tag == 'accession':
                key_values['accession type'] = child.attrib['type']
            elif child.tag == 'name' and child.attrib['type'] == 'identifier':
                key_values['name type identifier'] = child.text
            elif child.tag == 'name' and child.attrib['type'] == 'synonym':
                key_values['name type synonym'] = child.text
        results.append([
                # Using the get method of the dict object in case any particular
                # entry does not have all the required attributes.
                 key_values.get('category'            , None)
                ,key_values.get('created'             , None)
                ,key_values.get('last_updated'        , None)
                ,key_values.get('accession type'      , None)
                ,key_values.get('name type identifier', None)
                ,key_values.get('name type synonym'   , None)
                ])

    print(results)


if __name__ == '__main__':
    main()

答案 2 :(得分:1)

另一种方法。最近,我比较了几个XML解析库,发现它易于使用。我推荐。

from simplified_scrapy import SimplifiedDoc, utils

xml = '''your xml above'''
# xml = utils.getFileContent('your file name.xml')

results = []
doc = SimplifiedDoc(xml)
for ele in doc.selects('cell-line'):
  key_values = {}
  for k in ele:
    if k not in ['tag','html']:
      key_values[k]=ele[k]
  key_values['name type identifier'] = ele.select('name@type="identifier">text()')
  key_values['name type synonym'] = ele.selects('name@type="synonym">text()')
  results.append(key_values)
print (results)

结果:

[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]