Question

我正在使用jsoup来解析html文件。我已经成功删除了Html中的所有标签，但问题是，我还想删除文件开头的标题。例如：

WARC / 1.0

WARC-Type：回复

WARC-Date：2012-02-10T20：37：13Z

HTTP / 1.1 200确定

服务器：Apache

这是我的代码：

 static String readFile(String path, Charset encoding) throws IOException 
 {
     byte[] encoded = Files.readAllBytes(Paths.get(path));
     return new String(encoded, encoding);
 }
 String file=indexer.readFile("C:\\Users\\umair\\Downloads\\Compressed\\Assignment 1 Data IR\\Assignment 1 Data IR\\corpus\\corpus\\corpus\\clueweb12-0000tw-14-17002.txt", StandardCharsets.UTF_8);
 System.out.println(Jsoup.parse(file).text());

知道如何删除这些标题？

Answer 1

您可以使用

doc.body()

只获取HTML文档的正文而没有任何标题。当然，这假定您正在处理正确的HTML文档。

如何使用jsoup删除html文件开头的标题？

1 个答案: