Question

我想删除所有内容，这些内容不在xml标记中（清理），并可选择将其放在列表中。我得到了一些像这样的xml：

<tag>some text</tag> unwanted text <tag>some text</tag>

我希望用python（正则表达式）

来实现这个目标

('<tag>some text</tag>','<tag>some text</tag>')

我尝试过：

cleanup = re.findall(r"^<.>.*</.>$",  input)

但我认为整个输入也匹配正则表达式如何解决这个问题？

UPDATE1：

我尝试用

加载它

import xml.etree.ElementTree as ET
root = ET.fromstring(str(cleanup))

Answer 1

只想扩展已经在这里回答的内容，因为我认为正确的方法是 NOT 使用正则表达式处理类似xml的内容。您应该使用XML解析器，不需要的内容称为 tail ，您可以在解析时 CLEAN ，这是一种方法：

import xml.etree.ElementTree as ET

s = '''<root><tag>some text</tag> unwanted text <tag>some text</tag></root>'''

tree = ET.fromstring(s)

cleaned_tree = []

for node in tree:
    node.tail = ''
    cleaned_tree.append(ET.tostring(node))

print cleaned_tree # or print(cleaned_tree) if Python 3
['<tag>some text</tag>', '<tag>some text</tag>']

作为旁注：您可以查看 str（清理），并在我的示例中看到它缺少像 root 这样的标签。它失败 fromstring（）可能暗示你的xml源有问题。

python删除xml中的非标签

1 个答案: