Question

我正在尝试使用lxml来返回标记内的文本<ImageSet><LargeImage><URL>this text</URL></LargeImage></ImageSet>我的代码只返回每个标记下的文本的无。

这是我的代码：

# I am trying to get the URL text using lxml

for attr_list in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_list in tree.find(".//"+settings.AMAZON_NS+"LargeImage"):
        print(etree.tostring(image_list))
        print(image_list.findtext(".//"+settings.AMAZON_NS+"URL")) # This is only printing None.

以下是代码输出：

<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>
None
<Width xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">349</Width>
None
<URL xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01">http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg</URL>
None
<Height xmlns="http://webservices.amazon.com/AWSECommerceService/2009-10-01" Units="pixels">500</Height>

第11,17,23行......应该显示一个URL而不是None。

编辑1：让我试着澄清我的上述问题......

这是我正在使用的代码：

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        print(etree.tostring(image_set))

这是我得到的输出： http://dpaste.com/289187/

如何专门获取URL标记内的内容？

我尝试了以下（但是没有一个可以工作，但也许你们可以看到我试图通过失败的尝试做的一般想法）：

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        for image_url_set in image_set.find(".//"+settings.AMAZON_NS+"URL"):
            print(etree.tostring(image_url_set))

这是我得到的错误：

for image_set.find中的image_url_set（“.//”+ settings.AMAZON_NS +“URL”）： TypeError：'NoneType'对象不可迭代

for item in tree.iterfind(".//"+settings.AMAZON_NS+"ImageSet"):
    for image_set in item.find(".//"+settings.AMAZON_NS+"LargeImage"):
        for image_link in image_set.iter(".//"+settings.AMAZON_NS+"URL"):
            print(image_link.text)

甚至没有打印出来。

Answer 1

from cStringIO import StringIO
from lxml import etree

URL_TAG = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}URL"

tree = etree.fromstring(body)
print tree.findtext(".//%s" % (URL_TAG,)) # 1st way

for ev, el in etree.iterparse(StringIO(body), tag=URL_TAG): # 2nd approach
    print el.text

body是你的xml文本。

输出

http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg
http://ecx.images-amazon.com/images/I/51dSYJcTaTL.jpg

Answer 2

尝试替换

print(image_list.findtext(".//"+settings.AMAZON_NS+"URL"))

只是

print(image_list.text)

尝试使用lxml返回标记内的文本

2 个答案:

输出