Question

我的HTML内容包含<pre>标记，其中包含其他标记。应使用HTML实体转义<pre>内容中的所有尖括号。换句话说，每个<应该变为<，每个>都应变为>。

对于初学者，我只想找出哪些文件有违规内容。任何人都可以想到使用正则表达式的方法：

BAD：RegEx应符合此

<body>
    <h1>My Content</h1>
    <pre class="some-class">
        <foo>
            <bar>Content</bar>
            <script>
                alert('Hi!');
            </script>
        </foo>
        <br>
    </pre>

    <p>The middle</p>

    <pre class="other-class">
        <bar>
            <foo>Text</foo>
            <script>
                alert('Bye!');
            </script>
        </bar>
        <br>
    </pre>
    <p>The end</p>
</body>

GOOD：RegEx不应与此匹配。

<body>
    <h1>My Content</h1>
    <pre class="some-class">
        &lt;foo&gt;
            &lt;bar&gt;Content&lt;/bar&gt;
            &lt;script&gt;
                alert('Hi!');
            &lt;/script&gt;
        &lt;/foo&gt;
        &lt;br&gt;
    </pre>

    <p>The middle</p>

    <pre class="other-class">
        &lt;bar&gt;
            &lt;foo&gt;Text&lt;/foo&gt;
            &lt;script&gt;
                alert('Bye!');
            &lt;/script&gt;
        &lt;/bar&gt;
        &lt;br&gt;
    </pre>
    <p>The end</p>
</body>

Answer 1

要在正则表达式.*?中找到最短匹配项。另外，为了让.匹配换行符，需要DOT_ALL，(?s)。

Pattern prePattern = Pattern.compile("(?si)(<pre[^>]*>)(.*?)</pre>");
StringBuffer sb = new StringBuffer(html.length() + 1000);
Matcher m = prePattern.matcher(html);
while (m.find()) {
    String text = m.group(2);
    text = text.replace("<", "&lt;").replace(">", "&gt;");
    m.appendReplacement(sb, m.group(1) + text + "</pre>");
}
m.appendTail(sb);
html = sb.toString();

Answer 2

感谢@Jens和@Joop，我使用了一个结合JSoup解析器和RegEx的解决方案。

查找全部＆lt; pre＆gt;包含子元素的元素：

Document doc = Jsoup.parse（html）; 元素badPres = doc.select（＆＃34; pre：has（*）＆＃34;）;
循环使用@ Joop的RegEx解决方案。

需要查找包含其他标记的HTML预标记

2 个答案: