我必须在xml文档中处理两种类型的内联标记。第一种类型的标签包含我想要保留的文本。我可以用lxml的
处理这个问题etree.tostring(element, method="text", encoding='utf-8')
第二种类型的标签包括我不想保留的文字。我怎样才能摆脱这些标签及其文字?如果可能的话,我宁愿不使用正则表达式。
由于
答案 0 :(得分:10)
我认为在每种情况下strip_tags
和strip_elements
都是您想要的。例如,这个脚本:
from lxml import etree
text = "<x>hello, <z>keep me</z> and <y>ignore me</y>, and here's some <y>more</y> text</x>"
tree = etree.fromstring(text)
print etree.tostring(tree, pretty_print=True)
# Remove the <z> tags, but keep their contents:
etree.strip_tags(tree, 'z')
print '-' * 72
print etree.tostring(tree, pretty_print=True)
# Remove all the <y> tags including their contents:
etree.strip_elements(tree, 'y', with_tail=False)
print '-' * 72
print etree.tostring(tree, pretty_print=True)
...产生以下输出:
<x>hello, <z>keep me</z> and <y>ignore me</y>, and
here's some <y>more</y> text</x>
------------------------------------------------------------------------
<x>hello, keep me and <y>ignore me</y>, and
here's some <y>more</y> text</x>
------------------------------------------------------------------------
<x>hello, keep me and , and
here's some text</x>