Question

我想在使用JSoup时保留html实体。这是来自网站的utf-8测试字符串：

String html = "<html><body>hello &#151; world</body></html>";

String parsed = Jsoup.parse(html).toString();

如果在utf-8中打印已解析的输出，则序列＆amp;＃151看起来会转换为代码点值为151的字符。

当输出为utf-8时，有没有办法让JSoup保留原始实体？如果我输出ascii编码：

Document.OutputSettings settings = new Document.OutputSettings();
settings.charset(Charset.forName("ascii"));
Jsoup.parse(html).outputSettings(settings).toString();

我会得到：

hello &#x97; world

这就是我正在寻找的。

Answer 1

你已经找到了Jsoup缺失的功能（截至撰写Jsoup 1.8.3时）。

我可以看到三个选项：

选项1

在https://github.com/jhy/jsoup上发送功能请求我不确定你很快就会加入......

选项2

使用此SO答案中提供的解决方法：https://stackoverflow.com/a/34493022/363573

选项3

编写一个自定义NodeVisitor，它将使用代码点值将字符转回其HTML等效转义序列。

JSoup - 输出为utf-8时保留html实体？

1 个答案:

选项1

选项2

选项3