Question

我想在<body>和</body>

之间提取标签

String patternHtml = "(*?)<body>(.*?)</body>(*?)";
Pattern rHtml = Pattern.compile(pattern, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher mHtml = rHtml.matcher(html);

我不知道为什么，但这会使用<head>和<style>提取所有代码...

请：我需要使用正则表达式，请不要提供像Parser库这样的替代方案......

Answer 1

如果你只想（我引用）“提取标签”，我将其解释为开放节点，在你的html文本的body语句中，你可以使用下面的解决方案。

请注意，这是野蛮。你不应该用正则表达式“解析”html（我知道你知道，但其他读者可能不知道）。

// simple html file, has head/body and line breaks
String html = "<html>\r\n<head>\r\n<title>Foo</title>\r\n</head>\r\n" +
        "<body>\r\n<h1>Blah</h1>\r\n<h3>Meh</h3>\r\n</body>\r\n</html>";
// the pattern only checks for opening nodes
Pattern tagsWithinBody = Pattern.compile("<\\p{Alnum}+>");
// matcher is applied to whatever text is in between the "<body>" open and close nodes
Matcher matcher = tagsWithinBody.matcher(html.substring(html.indexOf("<body>") + 1, html.indexOf("</body>")));
// iterates over matcher as long as it finds text
while (matcher.find()) {
    System.out.println(matcher.group());
}

输出：

<h1>
<h3>

在标签Java Regex之间提取标签

1 个答案: