Question

我正在使用Jsoup从URL读取文本。以下链接提供了一些在将正文转换为文本时保留新行的提示 How do I preserve line breaks when using jsoup to convert html to plain text?

我使用以下行来转换标签

  String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
            .none().addTags("br", "p",  "h1"), new OutputSettings()
            .prettyPrint(true));
  System.out.println(prettyPrintedBodyFragment);

我仍然以单行获取正文/内容。有什么线索吗？

编辑：这是完整的源代码，我只看到一行输出

 public static void main(String[] args) throws Exception {

        Connection conn = Jsoup.connect("http://finance.yahoo.com/");
        Document doc  = conn.get();

         String body = doc.body().text();

        String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
                .none().addTags("br", "p",  "h1"), new OutputSettings()
                .prettyPrint(true));

        System.out.println(prettyPrintedBodyFragment);



    }

Answer 1

变化：

String body = doc.body().text();

要：

String body = doc.body().html();

由于您已经转储了代码，因此Whitelist无法在格式化文字时将其包含在内。

无法保留从URL读取的文本中的换行符

1 个答案: