Question

this is my problem.

String pattern1 = "<pre.*?>(.+?)</pre>";
Matcher m = Pattern.compile(pattern1).matcher(html);
if(m.find()) {
    String temp = m.group(1);
    System.out.println(temp);
}

temp does not retain line breaks...it flows as a single line. How to keep the line breaks within temp?

Answer 1

您不应该使用正则表达式解析HTML，但要解决此问题，请使用dotall修饰符......

String pattern1 = "(?s)<pre[^>]*>(.+?)</pre>";
                   ↑↑↑↑
                     |_______ Forces the . to span across newline sequences.

Answer 2

使用JSoup：html解析器

众所周知，你不应该使用正则表达式来解析html内容，你应该使用html解析器。您可以在下面看到如何使用JSoup：

String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");

for (Element pre : pres) {
    System.out.println(pre.text());
}

Pattern.DOTALL：单行编译标志

但是，如果您仍想使用正则表达式，请记住，它是一个与\n不匹配的通配符，除非您有意指定它，因此您可以通过不同方式实现此目的，例如使用{{ 1}}

Pattern.DOTALL

内联单行标志：

或者在正则表达式中使用内联String pattern1 = "<pre.*?>(.+?)</pre>"; Matcher m = Pattern.compile(pattern1, Pattern.DOTALL).matcher(html); if(m.find()) { String temp = m.group(1); System.out.println(temp); }标记，如下所示：

正则表达式技巧

或者您也可以使用正则表达式技巧，包括使用String pattern1 = "(?s)<pre.*?>(.+?)</pre>"; Matcher m = Pattern.compile(pattern1).matcher(html); if(m.find()) { String temp = m.group(1); System.out.println(temp); }，[\s\S]，[\d\D]等补充集合。如下所示：

[\w\W]

但正如 nhahtdh 在其评论中所指出的，此技巧可能会影响正则表达式引擎的性能。

<pre> tag does not retain line breaks when using regex Java

2 个答案:

使用JSoup：html解析器

Pattern.DOTALL：单行编译标志

内联单行标志：

正则表达式技巧