Question

我正在使用亚马逊的API来接收有关图书的信息。我正在尝试使用lxml来提取我的应用程序所需的XMl文档的特定部分。不过，我不太确定如何使用lxml。这是我得到的：

root = etree.XML(response)

为XML文档创建etree对象。

这是XML文档的样子： http://pastebin.com/GziDkf1a 实际上有多个“项目”，但我只粘贴了其中一个来给你一个具体的例子。对于每个项目，我想提取标题和ISBN。我如何使用我拥有的etree对象？

<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse

<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse

基本上，我不知道如何使用我的etree对象遍历树，我想学习如何。

编辑1： 我正在尝试以下代码：

tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    print(item)
    print(item.items()) # Apparently, there is nothing in item.items()
    for key, value in item.items():
        print(key)
        print(value)

但我得到以下输出： http://dpaste.com/287496/

我添加了print（item.items（）），它似乎只是一个空列表。但是，每个项目都是一个元素，但由于某种原因，它们没有项目。

编辑2：我可以使用以下代码来获取我想要的信息，但似乎lxml必须有一个更简单的方法...（这种方式看起来效率不高）：

for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    title_text = ""
    author_text = ""
    isbn_text = ""
    for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
        isbn_text = isbn.text
    for title in item.iterfind(".//"+AMAZON_NS+"Title"):
        title_text = title.text
    for author in item.iterfind(".//"+AMAZON_NS+"Author"):
        author_text = author.text
    print(title_text + " by " + author_text + " has ISBN: " + isbn_text)

Answer 1

经过测试，可以使用运行Python 2.7.1的lxml.etree和xml.etree.cElementTree。

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
t = ET.fromstring(xmlstring) # your data -- with 2 missing tags added at the end :-)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    # An ItemAttributes element has *children* named ISBN, Title, Author, etc.
    # NOTE WELL: *children* not *attributes*
    for tag in ('ISBN', 'Title'):
        # Find the first child with that name ...
        elem = ia.find(AMAZON_NS+tag)
        print "%s: %r" % (tag, elem.text)

输出：

ISBN: '0534950973'
Title: 'Introduction to the Theory of Computation'

如果要生成ItemAttributes节点的所有子节点的字典，只需稍微改动一下：

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
from pprint import pprint as pp
t = ET.fromstring(xmlstring)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
TAGPOS = len(AMAZON_NS)
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    item = {}
    # Iterate over all the children of the ItemAttributes node
    for elem in ia:
        # remove namespace stuff from key, remove extraneous whitepace from value
        item[elem.tag[TAGPOS:]] = elem.text.strip()
    pp(item)

，输出为：

{'Author': 'Michael Sipser',
 'Binding': 'Hardcover',
 'DeweyDecimalNumber': '511.35',
 'EAN': '9780534950972',
 'Edition': '2',
 'ISBN': '0534950973',
 'IsEligibleForTradeIn': '1',
 'Label': 'Course Technology',
 'Languages': '',
 'ListPrice': '',
 'Manufacturer': 'Course Technology',
 'NumberOfItems': '1',
 'NumberOfPages': '400',
 'PackageDimensions': '',
 'ProductGroup': 'Book',
 'ProductTypeName': 'ABIS_BOOK',
 'PublicationDate': '2005-02-15',
 'Publisher': 'Course Technology',
 'Studio': 'Course Technology',
 'Title': 'Introduction to the Theory of Computation',
 'TradeInValue': ''}

Answer 2

我建议先使用pyaws。那么您就不必担心XML解析了。如果不是，你可以使用一些东西：

from lxml import etree

tree = etree.parse(xmlResponse)
tree.xpath('//ISBN')[0].text

Answer 3

from lxml import etree
root = etree.XML("YourXMLData")  
items = root.findall('.//ItemAttributes')
for eachitem in items:
    for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

这是一种做法。您可以使用此处，而不是将XML作为字符串加载，您可以使用解析方法。但他们关键的是使用find方法及其朋友去你的特定节点，然后迭代节点字典。

Answer 4

由于您将整个响应作为一个大型XML字符串获得，因此您可以使用lxml的'fromstring'方法将其转换为完整的ElementTree对象。然后，你可以使用findall函数（或者实际上，因为你想迭代结果，iterfind函数），但是有一个问题：亚马逊的XML响应是命名空间的，所以你必须考虑到这个以便为lxml库正确搜索它。像这样的东西应该做的伎俩：

root=etree.fromstring(responseFromAmazon)

# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"

# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
   for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

编辑1

看看这是否更好：

root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}    
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
     item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text

然后，您可以根据需要访问项目[“标题”]，项目[“ISBN”]等。

如何使用lxml来获取XML文档的特定部分？

4 个答案: