我如何修改下面的代码,以便它挑选出包含html的description元素中找到的任何图像的来源?目前它只是从元素内部获取全文,我不知道如何修改它以获取任何img标记的来源。
>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', description.text
我最后也试过这个:
print(description.xpath("//img/@src"))
给了我'无'
XML结构是:
<guides>
<guide>
<id>guide 1</id>
<group>
<id></id>
<type></type>
<name></name>
</group>
<pages>
<page>
<id>page 1</id>
<name></name>
<description><p>Some text. <br /><img
width="81"
src="http://www.example.com/img.jpg"
alt="wave" height="63" style="float:
right;" /></p></description>
<boxes>
<box>
<id></id>
<name></name>
<type></type>
<map_id></map_id>
<column></column>
<position></position>
<hidden></hidden>
<created></created>
<updated></updated>
<assets>
<asset>
<id></id>
<name></name>
<type></type>
<description><img src="https://www.example.com/image.jpg" alt="image" height="42" width="42"></description>
<url/>
<owner>
<id></id>
<email></email>
<first_name></first_name>
<last_name></last_name>
</owner>
</asset>
</assets>
</box>
</boxes>
</page>
</pages>
</guide>
答案 0 :(得分:1)
description
元素的内容是HTML。解析它有多种方法,其中一种方法是html
{/ 1}}。
lxml
编辑,回应评论:
>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'
编辑:处理例外。
替换
>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', html.fromstring(description.text).attrib['src']
...
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')
带
'---------', html.fromstring(description.text).attrib['src']
编辑,回复11月9日评论:
try:
'---------', html.fromstring(description.text).attrib['src']
except KeyError:
'--------- No image URL present'
xml文件的输出,其中第二个指南元素根本不包含HTML,第三个包含没有src属性的HTML。
from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
print('---', guide.xpath('id')[0].text)
for pages in guide.xpath('.//pages'):
for page in pages:
print('------', page.xpath('id')[0].text)
for description in page.xpath('.//asset/description'):
try:
print('---------', html.fromstring(description.text).attrib['src'])
except TypeError:
print('--------- no src identifiable')
except KeyError:
print('--------- no src identifiable')
答案 1 :(得分:0)
您可以尝试this解决方案:
description.xpath("//img/@src")