我在utf-8中输入了html。在此输入中,重音字符显示为html实体。例如:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
我的目标是通过在Java中使用utf-8字符替换html实体来“规范化”html。换句话说,替换除 < > & " '
之外的所有实体。
目标:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
我需要这样做,以便更容易在测试中比较htmls,并且更容易用肉眼阅读(许多转义的重音字符使得它很难阅读)。
我不关心cdata部分(输入中没有cdata)。
我尝试过JSOUP(https://jsoup.org/)和Apache的Commons Text(https://commons.apache.org/proper/commons-text/)失败:
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is not good, keeps only the text content
String s1 = Jsoup.parse(html).text();
System.out.println("s1: " + s1);
// this is better, but it unescapes the < which is not what I want
String s2 = StringEscapeUtils.unescapeHtml4(html);
System.out.println("s2: " + s2);
}
StringEscapeUtils.unescapeHtml4()几乎就是我所需要的,但不幸的是,它还没有出现在&lt;也:
<body>árvíztűrő<b</body>
我该怎么做?
答案 0 :(得分:0)
查看Commons Text源代码很明显,StringEscapeUtils.unescapeHtml4()将工作委托给一个AggregateTranslator,它由4个CharSequenceTranslator组成:
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
我需要只有三位的译员才能完成我的目标。
所以就是这样:
// this is what I needed!
String s3 = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
).translate(html);
System.out.println("s3: " + s3);
整个方法:
@Test
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is what I needed!
CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
System.out.println("s3: " + s3);
}
结果:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>