Question

我一直在努力让正则表达式为我工作，但我坚持到最后一部分。我的目标是在xml元素包含在特定父元素中时删除它。示例xml看起来像这样：

<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png">
        <ri:page ri:content-title="Banana Farts" />             /* REMOVE THIS */
    </ri:attachment>
</ac:image>

我写的表达是：

(<ac:image.*?>)(<ri:attachment.*?)(<ri:page.*? />)(</ri:attachment></ac:image>)

以更易读的格式，我正在搜索四组

(<ac:image.*?>)                   //Find open image tag
(<ri:attachment.*?)               //Find open attachment tag
(<ri:page.*? />)                  //Find the page tag
(</ri:attachment></ac:image>)     //Find close image and attachment tags

这基本上有效，因为我可以使用：

删除notepad ++中的页面元素

/1/2/4

我的问题是搜索过于贪婪。在下面的示例中，它从头到尾抓取所有内容，而实际上只有第二个图像标记是有效的查找。

<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png" />
</ac:image>
<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png">
        <ri:page ri:content-title="Employee Portal Editor" />
    </ri:attachment>
</ac:image>

任何人都可以帮我完成这件事吗？我认为我所要做的就是添加?以使结束标记组不贪婪，但它无法正常工作。

Answer 1

请记住，正则表达式引擎会尽一切可能使模式成功。由于您在模式中使用了多个.*?，因此您可以为正则表达式引擎提供很大的灵活性来实现此目的。模式必须更具约束力。

为此，您可以将所有.*?替换为[^>]*

不要忘记在模式中的每个标记\s*之间添加可选的空格。

示例：

(<ac:image[^>]*> \s* <ri:attachment[^>]*> )     # group 1
 \s* <ri:page[^>]*/> \s*                        # what you need to remove
(</ri:attachment> \s* </ac:image>)              # group 2

替换：$1$2

捕获组太贪心了

1 个答案: