Question

我正在使用Python解析XML文件。

from xml.dom import minidom
xmldoc = minidom.parse('selections.xml')

但是当我执行它时，发生了这样的xml.parsers.expat.ExpatError: not well-formed (invalid token)错误。检查文件后，我发现有太多＆lt; ＆GT;在标签中。因此，我想使用正则表达式转义XML标记中的＆lt; 和＆gt; 。例如，在文本标记中，我想要转义“Winning 11”之外的＆lt; 和＆gt; 。

<writing>
    <topic id="10">I am a fun</topic>
    <date>2012-03-1</date>   
    <grade>86</grade>
    <text>
          You know he is a soccer fan,so you'd better to buy the game is <Winning 11>!
    </text>
</writing>

我知道＆lt;和＆gt;是<和>。由于我的XML文件中有太多标签，因此我想使用正则表达式在vim下解决它。

有人能给我一些想法吗？我是正规表达的新手。

Answer 1

详细信息：

:%s/    #search and replace on all lines in file
\(      #open \1 group
<text>  #\n find <text> tag with newline at it's end
.*      #grab all text until next match
\)      #close \1  group
<       #the `<` mark we're looking for
\(      #open \2 group
.*\n    #grab all text until end of line
.*      #grab text on the next line
<\/text> #find </text> tag
\)      #close \2 group
/       #vi replace with
\1      #paste \1 group in
\&lt;   #replace `<` with it's escaped version
\2      #paste \2 group in
/g      #Do on all occurrences

:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\&lt;\2/g

第二个与第一个一样，我已将<替换为>而<替换为>

:%s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\&gt;\2/g

与|

结合使用

:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\&lt;\2/g | %s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\&gt;\2/g

参考：
Capturing Groups and Backreferences

对于<部分，

Regex without vim escaping，请看第一组直到<标记，第二组是在{{1}}之后

Answer 2

真的不是一个好的情况。

但是，如果您知道文件中的有效xml标记，则以下内容仅匹配“错误”标记＆＃39;你想逃避：

<(?!/?grade|/?text)([^>]+)>

以|\?tag格式向该列表添加更多有效标记。

然后你可以用

代替

&lt;$1&gt;

此处位于regexr。

如果您需要在vim中执行此操作，那么您需要将其转换为vim正则表达式，这并不完全相同。

如何逃避＆＃39;＆lt;＆＃39;和＆＃39;＆gt;＆＃39;在xml标签中使用带有vim的正则表达式？

2 个答案: