Question

我有一个示例代码如下。

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

我得到输出为

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

但我希望输出为

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

如何解析它以便获得此输出？或者在Java中有另一种方法吗？

Answer 1

您可以禁用文档的漂亮打印，以获得您想要的输出。但您还必须将.text()更改为.html()。

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

Answer 2

HTML规范要求将多个空白字符折叠为单个空格。因此，在解析样本时，解析器正确地消除了多余的空白字符。

我认为你不能改变解析器的工作方式。您可以添加一个预处理步骤，用不可破坏的空格（）替换多个空格，这些空格不会折叠。然而，副作用当然是那些将是不可破坏的（如果你真的只想使用渲染文本那么无关紧要，如在doc.body（）。text（）中）。

使用jsoup解析html时避免删除空格和换行符

2 个答案: