这是我的lxml
的python代码import urllib.request
from lxml import etree
#import lxml.html as html
from copy import deepcopy
from lxml import etree
from lxml import html
some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
root = etree.fromstring(some_xml_data)
[c] = root.xpath('//span')
print(etree.tostring(root)) #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected
#but if i do some changes
for e in c.iterchildren("*"):
if e.tag == 'div':
e.getparent().remove(e)
print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion?
看起来我在lxml树上做了一些更改(删除了一些标签) lxml还会删除一些未包装的文本!如何防止lxml这样做并保存未翻录的文本?
答案 0 :(得分:3)
节点之后的文字称为尾,可以通过附加到父文的文本来保留它们,这里有一个示例:
In [1]: from lxml import html
In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
...:
In [3]: tree = html.fromstring(s)
In [4]: for node in tree.iterchildren("div"):
...: if node.tail:
...: node.getparent().text += node.tail
...: node.getparent().remove(node)
...:
In [5]: html.tostring(tree)
Out[5]: b'<span>text1text2text3</span>'
我使用html
,因为它比xml更可能是结构。您可以iterchildren
使用div
来避免额外检查标记。