我正在使用亚马逊的API来接收有关图书的信息。我正在尝试使用lxml来提取我的应用程序所需的XMl文档的特定部分。 不过,我不太确定如何使用lxml。 这是我得到的:
root = etree.XML(response)
为XML文档创建etree对象。
这是XML文档的样子: http://pastebin.com/GziDkf1a 实际上有多个“项目”,但我只粘贴了其中一个来给你一个具体的例子。 对于每个项目,我想提取标题和ISBN。我如何使用我拥有的etree对象?
<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse
<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse
基本上,我不知道如何使用我的etree对象遍历树,我想学习如何。
编辑1: 我正在尝试以下代码:
tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
print(item)
print(item.items()) # Apparently, there is nothing in item.items()
for key, value in item.items():
print(key)
print(value)
但我得到以下输出: http://dpaste.com/287496/
我添加了print(item.items()),它似乎只是一个空列表。但是,每个项目都是一个元素,但由于某种原因,它们没有项目。
编辑2: 我可以使用以下代码来获取我想要的信息,但似乎lxml必须有一个更简单的方法...(这种方式看起来效率不高):
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
title_text = ""
author_text = ""
isbn_text = ""
for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
isbn_text = isbn.text
for title in item.iterfind(".//"+AMAZON_NS+"Title"):
title_text = title.text
for author in item.iterfind(".//"+AMAZON_NS+"Author"):
author_text = author.text
print(title_text + " by " + author_text + " has ISBN: " + isbn_text)
答案 0 :(得分:2)
经过测试,可以使用运行Python 2.7.1的lxml.etree和xml.etree.cElementTree。
import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
t = ET.fromstring(xmlstring) # your data -- with 2 missing tags added at the end :-)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
# An ItemAttributes element has *children* named ISBN, Title, Author, etc.
# NOTE WELL: *children* not *attributes*
for tag in ('ISBN', 'Title'):
# Find the first child with that name ...
elem = ia.find(AMAZON_NS+tag)
print "%s: %r" % (tag, elem.text)
输出:
ISBN: '0534950973'
Title: 'Introduction to the Theory of Computation'
如果要生成ItemAttributes
节点的所有子节点的字典,只需稍微改动一下:
import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
from pprint import pprint as pp
t = ET.fromstring(xmlstring)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
TAGPOS = len(AMAZON_NS)
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
item = {}
# Iterate over all the children of the ItemAttributes node
for elem in ia:
# remove namespace stuff from key, remove extraneous whitepace from value
item[elem.tag[TAGPOS:]] = elem.text.strip()
pp(item)
,输出为:
{'Author': 'Michael Sipser',
'Binding': 'Hardcover',
'DeweyDecimalNumber': '511.35',
'EAN': '9780534950972',
'Edition': '2',
'ISBN': '0534950973',
'IsEligibleForTradeIn': '1',
'Label': 'Course Technology',
'Languages': '',
'ListPrice': '',
'Manufacturer': 'Course Technology',
'NumberOfItems': '1',
'NumberOfPages': '400',
'PackageDimensions': '',
'ProductGroup': 'Book',
'ProductTypeName': 'ABIS_BOOK',
'PublicationDate': '2005-02-15',
'Publisher': 'Course Technology',
'Studio': 'Course Technology',
'Title': 'Introduction to the Theory of Computation',
'TradeInValue': ''}
答案 1 :(得分:1)
我建议先使用pyaws。那么您就不必担心XML解析了。如果不是,你可以使用一些东西:
from lxml import etree
tree = etree.parse(xmlResponse)
tree.xpath('//ISBN')[0].text
答案 2 :(得分:1)
from lxml import etree
root = etree.XML("YourXMLData")
items = root.findall('.//ItemAttributes')
for eachitem in items:
for key,value in eachitem.items():
if key == 'ISBN':
# Do your stuff
if key == 'Title':
# Do your stuff
这是一种做法。您可以使用此处,而不是将XML作为字符串加载,您可以使用解析方法。但他们关键的是使用find
方法及其朋友去你的特定节点,然后迭代节点字典。
答案 3 :(得分:1)
由于您将整个响应作为一个大型XML字符串获得,因此您可以使用lxml的'fromstring'方法将其转换为完整的ElementTree对象。然后,你可以使用findall函数(或者实际上,因为你想迭代结果,iterfind函数),但是有一个问题:亚马逊的XML响应是命名空间的,所以你必须考虑到这个以便为lxml库正确搜索它。像这样的东西应该做的伎俩:
root=etree.fromstring(responseFromAmazon)
# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
for key,value in eachitem.items():
if key == 'ISBN':
# Do your stuff
if key == 'Title':
# Do your stuff
编辑1
看看这是否更好:
root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text
然后,您可以根据需要访问项目[“标题”],项目[“ISBN”]等。