元素的部分损失'使用lxml的文本内容

时间:2017-04-14 22:31:30

标签: python html python-3.x lxml

我有一些HTML标记,我想要摆脱<b>元素的一些<center>子元素(它的遗留标记......)。

问题:部分包含<center>个元素&#39;当我使用 Python lxml 删除子项时,文本消失。

示例程序(带有简化的说明性标记):

#!/usr/bin/env python3

from lxml import html, etree
from lxml.etree import tostring

html_snippet = """
<center>
    <b>IT wisdoms</b>
    <b>
        for your <a href="#">brain</a>:
    </b>
    NEVER <a href="#">change a running system</a> before the holidays!
</center>"""

tree = html.fromstring(html_snippet)
center_elem = tree.xpath("//center")[0]

print('----- BEFORE -----')
print(tostring(center_elem, pretty_print=True, encoding='unicode'))
for elem in center_elem.xpath("b"):
    elem.getparent().remove(elem)
print('----- AFTER -----')
print(tostring(center_elem, pretty_print=True, encoding='unicode'))

输出:

----- BEFORE -----
<center>
    <b>IT wisdoms</b>
    <b>
        for your <a href="#">brain</a>:
    </b>
    NEVER <a href="#">change a running system</a> before the holidays!
</center>

----- AFTER -----
<center>
    <a href="#">change a running system</a> before the holidays!
</center>

正如您所看到的,<b>个孩子已经消失了,但从不这个词消失了,而<a>元素和节日前的文字消失了! / strong>留下。

我无法弄清楚如何保留它!

1 个答案:

答案 0 :(得分:2)

尝试对要消除的元素使用drop_tree()

tree = html.fromstring(html_snippet)
center_elem = tree.xpath("//center")[0]
print('----- BEFORE -----')
print(etree.tostring(center_elem, pretty_print=True, encoding='unicode'))
for elem in center_elem.xpath("b"):
    elem.drop_tree()
print('----- AFTER -----')
print(etree.tostring(center_elem, pretty_print=True, encoding='unicode'))

返回:

----- BEFORE -----
<center>
    <b>IT wisdoms</b>
    <b>
        for your <a href="#">brain</a>:
    </b>
    NEVER <a href="#">change a running system</a> before the holidays!
</center>

----- AFTER -----
<center>


    NEVER <a href="#">change a running system</a> before the holidays!
</center>