Question

有可能以某种方式使用Tika的选项 - text-main和--html来获取页面的HTML主要内容吗？

Answer 1

您无法使用命令行tika-app.jar文件执行此操作，您需要编写一些Java代码才能执行此操作

如one of the Apache Tika examples所示，您的代码必须类似于：

ContentHandler handler = new BodyContentHandler(
            new ToXMLContentHandler());
String bodyHtml = null;

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    bodyHtml = handler.toString();
} finally {
    stream.close();
}
System.out.println(bodyHtml);

使用包含“test”的单个段落对Word文档运行的输出只是：

<p xmlns="http://www.w3.org/1999/xhtml">test</p>

如何使用Tika获取html标签的主要内容

1 个答案: