如何使用lxml来获取XML文档的特定部分?

时间:2010-12-16 00:38:44

标签: python lxml

我正在使用亚马逊的API来接收有关图书的信息。我正在尝试使用lxml来提取我的应用程序所需的XMl文档的特定部分。 不过,我不太确定如何使用lxml。 这是我得到的:

root = etree.XML(response)

为XML文档创建etree对象。

这是XML文档的样子: http://pastebin.com/GziDkf1a 实际上有多个“项目”,但我只粘贴了其中一个来给你一个具体的例子。 对于每个项目,我想提取标题和ISBN。我如何使用我拥有的etree对象?

<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse

<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse

基本上,我不知道如何使用我的etree对象遍历树,我想学习如何。

编辑1: 我正在尝试以下代码:

tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    print(item)
    print(item.items()) # Apparently, there is nothing in item.items()
    for key, value in item.items():
        print(key)
        print(value)

但我得到以下输出: http://dpaste.com/287496/

我添加了print(item.items()),它似乎只是一个空列表。但是,每个项目都是一个元素,但由于某种原因,它们没有项目。

编辑2: 我可以使用以下代码来获取我想要的信息,但似乎lxml必须有一个更简单的方法...(这种方式看起来效率不高):

for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    title_text = ""
    author_text = ""
    isbn_text = ""
    for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
        isbn_text = isbn.text
    for title in item.iterfind(".//"+AMAZON_NS+"Title"):
        title_text = title.text
    for author in item.iterfind(".//"+AMAZON_NS+"Author"):
        author_text = author.text
    print(title_text + " by " + author_text + " has ISBN: " + isbn_text)

4 个答案:

答案 0 :(得分:2)

经过测试,可以使用运行Python 2.7.1的lxml.etree和xml.etree.cElementTree。

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
t = ET.fromstring(xmlstring) # your data -- with 2 missing tags added at the end :-)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    # An ItemAttributes element has *children* named ISBN, Title, Author, etc.
    # NOTE WELL: *children* not *attributes*
    for tag in ('ISBN', 'Title'):
        # Find the first child with that name ...
        elem = ia.find(AMAZON_NS+tag)
        print "%s: %r" % (tag, elem.text)

输出:

ISBN: '0534950973'
Title: 'Introduction to the Theory of Computation'

如果要生成ItemAttributes节点的所有子节点的字典,只需稍微改动一下:

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
from pprint import pprint as pp
t = ET.fromstring(xmlstring)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
TAGPOS = len(AMAZON_NS)
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    item = {}
    # Iterate over all the children of the ItemAttributes node
    for elem in ia:
        # remove namespace stuff from key, remove extraneous whitepace from value
        item[elem.tag[TAGPOS:]] = elem.text.strip()
    pp(item)

,输出为:

{'Author': 'Michael Sipser',
 'Binding': 'Hardcover',
 'DeweyDecimalNumber': '511.35',
 'EAN': '9780534950972',
 'Edition': '2',
 'ISBN': '0534950973',
 'IsEligibleForTradeIn': '1',
 'Label': 'Course Technology',
 'Languages': '',
 'ListPrice': '',
 'Manufacturer': 'Course Technology',
 'NumberOfItems': '1',
 'NumberOfPages': '400',
 'PackageDimensions': '',
 'ProductGroup': 'Book',
 'ProductTypeName': 'ABIS_BOOK',
 'PublicationDate': '2005-02-15',
 'Publisher': 'Course Technology',
 'Studio': 'Course Technology',
 'Title': 'Introduction to the Theory of Computation',
 'TradeInValue': ''}

答案 1 :(得分:1)

我建议先使用pyaws。那么您就不必担心XML解析了。如果不是,你可以使用一些东西:

from lxml import etree

tree = etree.parse(xmlResponse)
tree.xpath('//ISBN')[0].text

答案 2 :(得分:1)

from lxml import etree
root = etree.XML("YourXMLData")  
items = root.findall('.//ItemAttributes')
for eachitem in items:
    for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

这是一种做法。您可以使用此处,而不是将XML作为字符串加载,您可以使用解析方法。但他们关键的是使用find方法及其朋友去你的特定节点,然后迭代节点字典。

答案 3 :(得分:1)

由于您将整个响应作为一个大型XML字符串获得,因此您可以使用lxml的'fromstring'方法将其转换为完整的ElementTree对象。然后,你可以使用findall函数(或者实际上,因为你想迭代结果,iterfind函数),但是有一个问题:亚马逊的XML响应是命名空间的,所以你必须考虑到这个以便为lxml库正确搜索它。像这样的东西应该做的伎俩:

root=etree.fromstring(responseFromAmazon)

# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"

# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
   for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

编辑1

看看这是否更好:

root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}    
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
     item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text

然后,您可以根据需要访问项目[“标题”],项目[“ISBN”]等。