Question

一个软件给我带来了不好的xml：

<sometag> some textnode with < and > characters in the middle of it</sometag>
So you can potentially have <notatag> but <isatag>some text</isatag>

因此，当我尝试将其放入minidom的xml解析器中时，可以理解的是它并不开心。

我的目标是翻译＆lt; ＆GT;将字符转换为适当的转义序列：

<sometag> some textnode with &lt; and &gt; characters in the middle of it</sometag>
So you can potentially have &lt;notatag&gt; but <isatag>some text</isatag>

我看到了lxml的解析器恢复选项（http://lxml.de/parsing.html），但它尝试完成并关闭看起来像标记的内容，或者删除wild＆lt; ＆GT;字符。我希望保持文本完全相同，所有不能将有效标记转换为转义字符的文本。

我没有完整的可能标签列表
但它只能包含字母数字字符。没有属性，没有奇怪的东西。
您可以拥有嵌套代码
我目前在运行解析器之前逐行处理文件以执行其他操作，将修复程序放在这里将节省我的时间并将不胜感激。你不能将一对标签分成两行。

感谢您的时间和您的帮助。

用wild＆lt;修复破碎的xml ＆GT;迹象

0 个答案: