尝试使用lxml返回标记内的文本

时间:2010-12-21 19:08:27

标签: python xml lxml

我正在尝试使用lxml来返回标记内的文本<ImageSet><LargeImage><URL>this text</URL></LargeImage></ImageSet>我的代码只返回每个标记下的文本的无。

这是我的代码:

# I am trying to get the URL text using lxml

for attr_list in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_list in tree.find(".//"+settings.AMAZON_NS+"LargeImage"):
        print(etree.tostring(image_list))
        print(image_list.findtext(".//"+settings.AMAZON_NS+"URL")) # This is only printing None.

以下是代码输出:

<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>

第11,17,23行......应该显示一个URL而不是None。

编辑1:让我试着澄清我的上述问题......

这是我正在使用的代码:

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        print(etree.tostring(image_set))

这是我得到的输出: http://dpaste.com/289187/

如何专门获取URL标记内的内容?

我尝试了以下(但是没有一个可以工作,但也许你们可以看到我试图通过失败的尝试做的一般想法):

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        for image_url_set in image_set.find(".//"+settings.AMAZON_NS+"URL"):
            print(etree.tostring(image_url_set))

这是我得到的错误:

for image_set.find中的image_url_set(“.//”+ settings.AMAZON_NS +“URL”): TypeError:'NoneType'对象不可迭代

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        for image_link in image_set.iter(".//"+settings.AMAZON_NS+"URL"):
            print(image_link.text)

甚至没有打印出来。

2 个答案:

答案 0 :(得分:1)

from cStringIO import StringIO
from lxml import etree

URL_TAG = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}URL"

tree = etree.fromstring(body)
print tree.findtext(".//%s" % (URL_TAG,)) # 1st way

for ev, el in etree.iterparse(StringIO(body), tag=URL_TAG): # 2nd approach
    print el.text

body是你的xml文本。

输出

http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg

答案 1 :(得分:0)

尝试替换

print(image_list.findtext(".//"+settings.AMAZON_NS+"URL"))

只是

print(image_list.text)