使用lxml从xml中提取数据的最有效方法

时间:2013-08-16 02:02:32

标签: python xml xpath lxml python-3.3

我有一个大型xml文件的以下片段。我想提取特定的命名空间,例如xmlns:dc="http://purl.org/dc/elements/1.1/"。目前我可以这样做:

tree = etree.parse(file)
    for element in tree.getiterator('{http://www.openarchives.org/OAI/2.0/}record'):
        for leaf in element.getiterator('{http://purl.org/dc/elements/1.1/}subject'):
            print(leaf)

问题是我希望在{http://purl.org/dc/elements/1.1/}命名空间中获取多个标签的数据。我还想简化一些事情并一直在研究如何使用xpath,但似乎无法弄明白。我可以使用xpath,如果是这样,更重要的是它对我的目标更好吗?

这是xml:

<?xml version="1.0" encoding="UTF-8" ?>



<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2013-08-15T23:24:55Z</responseDate>
<request verb="ListRecords" resumptionToken="0/500/121403/nsdl_dc/null/null/null">http://nsdldev.org/oai</request>

<!-- Showing records 501 through 1000 out of 121403 total  -->

<ListRecords>


  <record>
    <header>
      <identifier>oai:nsdl.org:2200/20110926115158975T</identifier>
      <datestamp>2013-05-29T16:44:49Z</datestamp>
       <setSpec>ncs-NSDL-COLLECTION-000-003-112-056</setSpec>
      </header>
    <metadata>
    <nsdl_dc:nsdl_dc xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/"
                 xmlns:dc="http://purl.org/dc/elements/1.1/"
                 xmlns:dct="http://purl.org/dc/terms/"
                 xmlns:lar="http://ns.nsdl.org/schemas/dc/lar"
                 xmlns:ieee="http://www.ieee.org/xsd/LOMv1p0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 schemaVersion="1.02.020"
                 xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd">
   <lar:readiness xsi:type="lar:Ready">Fully ready</lar:readiness>
   <dc:identifier xsi:type="dct:URI">http://www.exo.net/~emuller/activities/Hot%20Sauce%20Hot%20Spots.pdf</dc:identifier>
   <dc:relation xsi:type="nsdl_dc:NSDLPartnerURL">http://howtosmile.org/record/4427</dc:relation>
   <dc:title>Hot Sauce Hot Spots</dc:title>
   <dc:description>In this activity, learners model hot spot island formation, orientation and progression with condiments. Learners squirt a thick condiment sauce on a coarsely woven fabric to model how volcanic island hot spots form.</dc:description>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Oceanography</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Anthropology</dc:subject>
   <dc:subject>Physical science</dc:subject>
   <dc:subject>Physics</dc:subject>
   <dc:subject>General science</dc:subject>
   <dc:subject>hot spot island</dc:subject>
   <dc:subject>volcano</dc:subject>
   <dc:subject>tectonic plates</dc:subject>
   <dc:subject>Earth</dc:subject>
   <dc:subject>molten</dc:subject>
   <dc:subject>magma</dc:subject>
   <dc:subject>eruption</dc:subject>
   <dc:subject>undersea</dc:subject>
   <dc:subject>ocean</dc:subject>
   <dc:subject>island</dc:subject>
   <dc:subject>Earth Processes</dc:subject>
   <dc:subject>Volcanoes and Plate Tectonics</dc:subject>
   <dc:subject>Earth Structure</dc:subject>
   <dc:subject>Rocks and Minerals</dc:subject>
   <dc:subject>Oceans and Water</dc:subject>
   <dc:subject>Geologic Time</dc:subject>
   <dc:subject>Heat and Temperature</dc:subject>
   <dc:subject>Conducting Investigations</dc:subject>
   <dc:language>en-US</dc:language>
   <dc:format>application/pdf</dc:format>
   <lar:accessMode xsi:type="lar:ModeAcc">visual</lar:accessMode>
   <lar:accessMode xsi:type="lar:ModeAcc">tactile</lar:accessMode>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Upper Elementary</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Middle School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">High School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Informal Education</dct:educationLevel>
   <dct:audience xsi:type="nsdl_dc:NSDLAudience">Learner</dct:audience>
   <dc:type xsi:type="nsdl_dc:NSDLType">Activity</dc:type>
   <dc:type xsi:type="nsdl_dc:NSDLType">Model</dc:type>
   <dct:isPartOf>http://www.exo.net/~emuller/activities/index.html</dct:isPartOf>
   <dc:date xsi:type="dct:W3CDTF">2007</dc:date>
   <dc:creator>Eric Muller</dc:creator>
   <dc:contributor>The Exploratorium</dc:contributor>
   <dct:accessRights xsi:type="nsdl_dc:NSDLAccess">Free access</dct:accessRights>
   <dc:rights>Copyright 2007 Do Science</dc:rights>
   <dct:license>Owner license</dct:license>
   <lar:licenseProperty xsi:type="lar:LicProp">Terms of use unknown</lar:licenseProperty>
   <dct:rightsHolder>Do Science</dct:rightsHolder>
   <lar:metadataTerms>The following entity, University Corporation for Atmospheric Research (UCAR), has claims on the use of this metadata. This claim is as follows: The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. The entity provided more information at: http://nsdl.org/help/terms-of-use</lar:metadataTerms>
   <lar:metadataTerms>The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. More information is available at: http://nsdl.org/help/terms-of-use.</lar:metadataTerms>
</nsdl_dc:nsdl_dc>

    </metadata>
  </record>

2 个答案:

答案 0 :(得分:2)

不清楚您想要访问的内容,但请尝试以下内容:

from lxml import etree
doc=etree.parse( xmlfile )
ns={'dc': 'http://purl.org/dc/elements/1.1/', 
  'oai': 'http://www.openarchives.org/OAI/2.0/'}
doc.xpath( '//dc:subject' , namespaces=ns ) # get all of the dc:subjects
doc.xpath( '//dc:*', namespaces=ns )  # get all elements in dc: namespace
# more specific path 
doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*/dc:*', namespaces=ns )
x=doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*' )
x[0].xpath( '*[contains(.,"Geo")]' )  # you can also call xpath from non document nodes
x[0].xpath( 'dc:subject/text()' , namespaces=ns ) # get the text of dc:subjects

阅读python或lxml docs之外的一些关于xpath的文档。 他们告诉你如何在python中使用xpath,但它们实际上并不是一个xpath教程。

请注意,find(),findall()方法采用ElementPaths,这是一种 xpath类似表达式的有限子集。

答案 1 :(得分:0)

for element in tree.findall(".//{http://purl.org/dc/elements/1.1/}subject"):
    print element