Question

我有一个XML文件，其中包含以下字符串：

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

在XML的主体中，我有>和<个字符，这些字符与XML规范不兼容。我需要替换它们，以便>和<位于：

 ' "> ' 
 ' " > ' and 
 ' </ '

分别应该 NOT 被替换，>和<的所有其他出现应该被“大于”和“小于”的字符串替换。所以结果应该是：

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

我怎么能用Python做到这一点？

Answer 1

似乎我是为>做的：

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<! - 负面的背后断言，

?! - 负面前瞻断言，

(...)(...)是合乎逻辑的AND，

所以整个表达意味着“替换'＆gt;'的所有出现'哪个（不以'“开头）和（不以'”开头）和（不以''结尾）

案例<类似

Answer 2

您可以将lxml.etree.XMLParser与recover=True选项一起使用：

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

输出

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

注意：>在文档中保留为>，<被完全删除（作为xml文本中的无效字符）。

基于正则表达式的解决方案

对于类似xml的简单内容，您可以使用re.split()将标记与文本分开，并在非标记文本区域中进行替换：

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

为避免标记误报，您可以删除上面的'{ws}'部分内容。

输出

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

注意：<>都会在文本中转义。

你可以调用任何函数而不是escape(text)，例如，

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

Answer 3

使用ElementTree进行XML解析。

Python中的匹配模式

3 个答案:

输出

基于正则表达式的解决方案

输出