Question

我想将Html转换为纯文本。

要求::

应保留换行符
html字符串和html文件的转换应该相同。

FileContent ::

＆＃13;

<html>

<body>

  <h1>header tag</h1>

  <p>paragraph tag</p>

  <div>div1</div>

  <div>div2</div>

  <div>div3</div>

</body>

</html>

＆＃13;

以下是代码::

public static void main(String[] args) {
    StringBuilder contentBuilder = new StringBuilder();
    try {
        BufferedReader in = new BufferedReader(new FileReader("html.html"));
        String str;
        while ((str = in.readLine()) != null) {
            contentBuilder.append(str);
            contentBuilder.append(System.getProperty("line.separator"));
        }
        in.close();
    } catch (IOException e) {
        System.out.println(e);
    }
    String content = contentBuilder.toString();

    String html="<html><body><h1>header tag</h1><p>paragraph tag</p><div>div1</div><div>div2</div><div>div3</div></body></html>";
    String output = HTMLtoPlainText(html);
    System.out.println("with  string::\n"+output);
    output = HTMLtoPlainText(content);
    System.out.println("with  html file::\n"+output);
}
private static String HTMLtoPlainText(String html) {
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));
    document.select("div").before("\n");

    String output=Jsoup.clean(document.html(), "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
output=Parser.unescapeEntities(output, false);
    return output;
}

输出为：

字符串：

header tagparagraph标签

DIV1

DIV2

DIV3

使用html文件：

标题标记

段落标记

DIV1

DIV2

DIV3

对这两个不同的输入执行相同的方法后，我们得到了不同的输出，如

如果是FileContent：

我在标题标记之前有新行，标题标记和段落标记之间有新行。[我分析的原因是标记之间有\ n字符，并且没有被清洁方法自动替换]

那么我应该使用什么才能为这两个不同的输入获得相同的结果呢？

使用Jsoup转换HTML文本时未删除\ n

0 个答案: