Question

我如何修改下面的代码，以便它挑选出包含html的description元素中找到的任何图像的来源？目前它只是从元素内部获取全文，我不知道如何修改它以获取任何img标记的来源。

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text

我最后也试过这个：

print(description.xpath("//img/@src"))

给了我'无'

XML结构是：

<guides>
<guide>
    <id>guide 1</id>
    <group>
    <id></id> 
    <type></type>
    <name></name>
    </group>
    <pages>
        <page>
            <id>page 1</id>
            <name></name>
            <description>&lt;p&gt;Some text. &lt;br /&gt;&lt;img 
            width=&quot;81&quot; 
            src=&quot;http://www.example.com/img.jpg&quot; 
             alt=&quot;wave&quot; height=&quot;63&quot; style=&quot;float: 
              right;&quot; /&gt;&lt;/p&gt;</description>
            <boxes>
                <box>
                    <id></id>
                    <name></name>
                    <type></type>
                    <map_id></map_id>
                    <column></column>
                    <position></position>
                    <hidden></hidden>
                    <created></created>
                    <updated></updated>
                    <assets>
                        <asset>
                            <id></id>
                            <name></name>
                            <type></type>
                       <description>&lt;img src=&quot;https://www.example.com/image.jpg&quot; alt=&quot;image&quot; height=&quot;42&quot; width=&quot;42&quot;&gt;</description>
                            <url/>
                            <owner>
                                <id></id>
                                <email></email>
                                <first_name></first_name>
                                <last_name></last_name>
                            </owner>
                        </asset>
                    </assets>
                </box>
            </boxes>
        </page>
    </pages>
</guide>

Answer 1

description元素的内容是HTML。解析它有多种方法，其中一种方法是html {/ 1}}。

lxml

编辑，回应评论：

>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'

编辑：处理例外。

替换

>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', html.fromstring(description.text).attrib['src']
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')

带

'---------', html.fromstring(description.text).attrib['src']

编辑，回复11月9日评论：

try:
    '---------', html.fromstring(description.text).attrib['src']

except KeyError:
    '--------- No image URL present'

xml文件的输出，其中第二个指南元素根本不包含HTML，第三个包含没有src属性的HTML。

from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
    print('---', guide.xpath('id')[0].text)
    for pages in guide.xpath('.//pages'):
        for page in pages:
            print('------', page.xpath('id')[0].text)
            for description in page.xpath('.//asset/description'):
                try:
                    print('---------', html.fromstring(description.text).attrib['src'])
                except TypeError:
                    print('--------- no src identifiable')
                except KeyError:
                    print('--------- no src identifiable')

Answer 2

您可以尝试this解决方案：

description.xpath("//img/@src")

元素中img src的xpath

2 个答案: