我在使用Jsoup替换给定HTML中所有<pre>
元素内的换行符时遇到问题。
这是我到目前为止所尝试的,以及我面临的问题。
我正在尝试将所有\n
个字符替换为<br>
仅用于所有<pre>
标记中的innerHtml。我想保留其余的内容。
代码是:
String body = "<p>This is the output:</p>\n<pre class=\"lang-xml prettyprint prettyprinted\">\n<code><span class=\"dec\"><!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\"></span><span class=\"pln\">\n</span><span class=\"tag\"><HTML></span><span class=\"pln\">\n </span><span class=\"tag\"><HEAD></span><span class=\"pln\">\n </span><span class=\"tag\"><META</span><span class=\"pln\"> </span><span class=\"atn\">http-equiv</span><span class=\"pun\">=</span><span class=\"atv\">\"Content-Type\"</span><span class=\"pln\"> </span><span class=\"atn\">content</span><span class=\"pun\">=</span><span class=\"atv\">\"text/html; charset=iso-8859-1\"</span><span class=\"tag\">></span><span class=\"pln\">\n </span><span class=\"tag\"><TITLE></span><span class=\"pln\">GeteBayOfficialTime</span><span class=\"tag\"></TITLE></span><span class=\"pln\">\n </span><span class=\"tag\"></HEAD></span><span class=\"pln\">\n </span><span class=\"tag\"><BODY></span><span class=\"pln\">\n\n* About to connect() to api.ebay.com port 443 (#0)\n* Trying 66.135.211.100... * Timeout\n* Trying 66.135.211.140... * Timeout\n* Trying 66.211.179.150... * Timeout\n* Trying 66.211.179.180... * Timeout\n* Trying 66.135.211.101... * Timeout\n* Trying 66.211.179.148... * Timeout\n* connect() timed out!\n* Closing connection #0\n</span><span class=\"tag\"><P></span><span class=\"pln\">Error sending request</span></code></pre>";
log.info("printing before creating a Jsoup Doc "+ body);
Document bodyDom = Jsoup.parse(body);
log.info("printing after creating a Jsoup Doc "+ bodyDom.html());
Elements preTags = bodyDom.getElementsByTag("pre");
for (Element pre : preTags) {
pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));
log.info("Pre element with linebreaks replaced -" + pre);
}
body = bodyDom.html();
这是日志,似乎html源在解析Jsoup文档后丢失了换行符。 :
**2013-12-10 10:14:59 INFO FormattingTest:166** - printing before creating a Jsoup Doc <p>This is the output:</p>
<pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"></span><span class="pln">
</span><span class="tag"><HTML></span><span class="pln">
</span><span class="tag"><HEAD></span><span class="pln">
</span><span class="tag"><META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">></span><span class="pln">
</span><span class="tag"><TITLE></span><span class="pln">GeteBayOfficialTime</span><span class="tag"></TITLE></span><span class="pln">
</span><span class="tag"></HEAD></span><span class="pln">
</span><span class="tag"><BODY></span><span class="pln">
* About to connect() to api.ebay.com port 443 (#0)
* Trying 66.135.211.100... * Timeout
* Trying 66.135.211.140... * Timeout
* Trying 66.211.179.150... * Timeout
* Trying 66.211.179.180... * Timeout
* Trying 66.135.211.101... * Timeout
* Trying 66.211.179.148... * Timeout
* connect() timed out!
* Closing connection #0
</span><span class="tag"><P></span><span class="pln">Error sending request</span></code></pre>
**2013-12-10 10:14:59 INFO FormattingTest:168** - printing after creating a Jsoup Doc <html>
<head></head>
<body>
<p>This is the output:</p>
<pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"></span><span class="pln"> </span><span class="tag"><HTML></span><span class="pln"> </span><span class="tag"><HEAD></span><span class="pln"> </span><span class="tag"><META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">></span><span class="pln"> </span><span class="tag"><TITLE></span><span class="pln">GeteBayOfficialTime</span><span class="tag"></TITLE></span><span class="pln"> </span><span class="tag"></HEAD></span><span class="pln"> </span><span class="tag"><BODY></span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag"><P></span><span class="pln">Error sending request</span></code></pre>
</body>
</html>
2013-12-10 10:14:59 INFO FormattingTest:174 - Pre element with linebreaks replaced - <pre class="lang-xml prettyprint prettyprinted"><code><span class="dec"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"></span><span class="pln"> </span><span class="tag"><HTML></span><span class="pln"> </span><span class="tag"><HEAD></span><span class="pln"> </span><span class="tag"><META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">></span><span class="pln"> </span><span class="tag"><TITLE></span><span class="pln">GeteBayOfficialTime</span><span class="tag"></TITLE></span><span class="pln"> </span><span class="tag"></HEAD></span><span class="pln"> </span><span class="tag"><BODY></span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag"><P></span><span class="pln">Error sending request</span></code></pre>
不确定是什么问题。这与另一个HTML源码一起使用 - “\ nResponse:\ n some thext \ n \ ndsjkhskjdh sdjhasjkdas \ n”
正确转换为 -
Response :
some text
dsjkhskjdh sdjhasjkdas
不确定为什么第一个样本没有!!
答案 0 :(得分:3)
问题是当你尝试这样做时:
Jsoup.parse("\nText\nNex").html();
你得到:
text nex
从this questions开始,你可以这样做:
Document bodyDom = Jsoup.parse(body.replaceAll("(\\r\\n|\\n)", "<br />"));
在解析文档之前,它会替换换行符。
<pre>
代码要仅替换两个pre
标记之间的换行符,请使用正则表达式提取它们并替换:
Pattern preP = Pattern.compile("<pre.*?>.+?</pre>", Pattern.DOTALL
| Pattern.CASE_INSENSITIVE);
Matcher m = preP.matcher(body);
while (m.find()) {
String toReplace = m.group();
String replaced = toReplace.replaceAll("(\r\n|\n)", "<br />");
body = body.replace(toReplace, replaced);
}
.+*
是一个贪婪的限定词,它与/pre
的第一次出现相匹配,您可以尝试使用正则表达式,但这是不可能的,请参阅this answers以获得更好的解释。我建议你使用下一个选项。
您可以看到正则表达式here的示例。
clean
解析前的字符串从the second asnwers您可以使用:
Document.OutputSettings outputSettings = new Document.OutputSettings()
.prettyPrint(false);
body = Jsoup.clean(body, "", Whitelist.relaxed(), outputSettings);
之后(如原始代码中所示):
pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));
prettyPrint
选项使clean
方法逃避换行符,稍后解析器正确处理它
干杯