Question

我有这段代码：

from lxml.html import fromstring, tostring

html = "<p><img src='some_pic.jpg' />Here is some text</p>"

doc = fromstring(html)
img = doc.find('.//img')
doc.remove(img)

print tostring(doc)

输出为：<p></p>

为什么删除img标签也会删除后面的文字？换句话说，为什么打印出结果：<p>Here is some text</p> 如何删除该标记而不删除文本？注意，即使我在img上包含一个明确的结束标记，即得到相同的结果，即：

html = "<p><img src='some_pic.jpg'></img>Here is some text</p>"

Answer 1

Here is some text文字是img标记的tail - 它是元素的一部分，并且正在使用元素删除。

保留tail - 将其分配给img父母的文字：

from lxml.html import fromstring, tostring

html = "<p><img src='some_pic.jpg' />Here is some text</p>"

doc = fromstring(html)
img = doc.find('.//img')
parent = img.getparent()
parent.text = img.tail
doc.remove(img)

print tostring(doc)

打印：

<p>Here is some text</p>

删除lxml中的img标记

1 个答案: