使用Jsoup替换子元素内的换行字符

时间:2013-12-10 18:32:18

标签: java html jsoup

我在使用Jsoup替换给定HTML中所有<pre>元素内的换行符时遇到问题。 这是我到目前为止所尝试的,以及我面临的问题。 我正在尝试将所有\n个字符替换为<br>仅用于所有<pre>标记中的innerHtml。我想保留其余的内容。 代码是:

String body = "<p>This is the output:</p>\n<pre class=\"lang-xml prettyprint prettyprinted\">\n<code><span class=\"dec\">&lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\"&gt;</span><span class=\"pln\">\n</span><span class=\"tag\">&lt;HTML&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;HEAD&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;META</span><span class=\"pln\"> </span><span class=\"atn\">http-equiv</span><span class=\"pun\">=</span><span class=\"atv\">\"Content-Type\"</span><span class=\"pln\"> </span><span class=\"atn\">content</span><span class=\"pun\">=</span><span class=\"atv\">\"text/html; charset=iso-8859-1\"</span><span class=\"tag\">&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;TITLE&gt;</span><span class=\"pln\">GeteBayOfficialTime</span><span class=\"tag\">&lt;/TITLE&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;/HEAD&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;BODY&gt;</span><span class=\"pln\">\n\n* About to connect() to api.ebay.com port 443 (#0)\n*   Trying 66.135.211.100... * Timeout\n*   Trying 66.135.211.140... * Timeout\n*   Trying 66.211.179.150... * Timeout\n*   Trying 66.211.179.180... * Timeout\n*   Trying 66.135.211.101... * Timeout\n*   Trying 66.211.179.148... * Timeout\n* connect() timed out!\n* Closing connection #0\n</span><span class=\"tag\">&lt;P&gt;</span><span class=\"pln\">Error sending request</span></code></pre>";
            log.info("printing before creating a Jsoup Doc "+  body);
            Document bodyDom = Jsoup.parse(body);
            log.info("printing after creating a Jsoup Doc "+  bodyDom.html());

            Elements preTags = bodyDom.getElementsByTag("pre");

            for (Element pre : preTags) {
                pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));
                log.info("Pre element with linebreaks replaced -" + pre);
            }

            body = bodyDom.html();

这是日志,似乎html源在解析Jsoup文档后丢失了换行符。 :

**2013-12-10 10:14:59 INFO  FormattingTest:166** - printing before creating a Jsoup Doc <p>This is the output:</p>
<pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;</span><span class="pln">
</span><span class="tag">&lt;HTML&gt;</span><span class="pln">
    </span><span class="tag">&lt;HEAD&gt;</span><span class="pln">
        </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">&gt;</span><span class="pln">
        </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln">
    </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln">
    </span><span class="tag">&lt;BODY&gt;</span><span class="pln">

* About to connect() to api.ebay.com port 443 (#0)
*   Trying 66.135.211.100... * Timeout
*   Trying 66.135.211.140... * Timeout
*   Trying 66.211.179.150... * Timeout
*   Trying 66.211.179.180... * Timeout
*   Trying 66.135.211.101... * Timeout
*   Trying 66.211.179.148... * Timeout
* connect() timed out!
* Closing connection #0
</span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>


**2013-12-10 10:14:59 INFO  FormattingTest:168** - printing after creating a Jsoup Doc <html>
 <head></head>
 <body>
  <p>This is the output:</p> 
  <pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>
 </body>
</html>
2013-12-10 10:14:59 INFO  FormattingTest:174 - Pre element with linebreaks replaced -  <pre class="lang-xml prettyprint prettyprinted"><code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>

不确定是什么问题。这与另一个HTML源码一起使用 - “\ nResponse:\ n some thext \ n \ ndsjkhskjdh sdjhasjkdas \ n”

正确转换为 -


Response :
some text

dsjkhskjdh sdjhasjkdas

不确定为什么第一个样本没有!!

1 个答案:

答案 0 :(得分:3)

问题是当你尝试这样做时:

    Jsoup.parse("\nText\nNex").html();

你得到:

    text nex

this questions开始,你可以这样做:

    Document bodyDom = Jsoup.parse(body.replaceAll("(\\r\\n|\\n)", "<br />"));

在解析文档之前,它会替换换行符。

仅替换<pre>代码

要仅替换两个pre标记之间的换行符,请使用正则表达式提取它们并替换:

    Pattern preP = Pattern.compile("<pre.*?>.+?</pre>", Pattern.DOTALL
            | Pattern.CASE_INSENSITIVE);
    Matcher m = preP.matcher(body);
    while (m.find()) {
        String toReplace = m.group();
        String replaced = toReplace.replaceAll("(\r\n|\n)", "<br />");
        body = body.replace(toReplace, replaced);
    }

.+*是一个贪婪的限定词,它与/pre的第一次出现相匹配,您可以尝试使用正则表达式,但这是不可能的,请参阅this answers以获得更好的解释。我建议你使用下一个选项。

您可以看到正则表达式here的示例。

clean解析前的字符串

the second asnwers您可以使用:

    Document.OutputSettings outputSettings = new Document.OutputSettings()
            .prettyPrint(false);
    body = Jsoup.clean(body, "", Whitelist.relaxed(), outputSettings);

之后(如原始代码中所示):

    pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));

prettyPrint选项使clean方法逃避换行符,稍后解析器正确处理它

干杯