如何使用TIka读取大文件?

时间:2015-06-26 18:02:05

标签: apache-tika

我正在使用Tika解析大型pdf和word文档,但我得到了他的错误信息。

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

如何增加限额?

2 个答案:

答案 0 :(得分:17)

假设您基本上遵循Tika example for extracting to plain text,那么您需要做的只是create your BodyContentHandler with a write limit of -1来禁用写入限制,如javadocs

中所述

您的代码看起来像(inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}

答案 1 :(得分:1)

我不同意@Gagravarr使用写限制-1,因为在-1情况下选择的默认值实际上是100000。

如果我没错,Tika BodyContentHandler> WriteOutContentHandler的文档说明:

  

内部字符串缓冲区以100k字符为界。

然而,实现此目的的最佳方法是将StringWriter的对象作为参数传递,而不是-1。

StringWriter any = new StringWriter();

然后

BodyContentHandler handler = new BodyContentHandler(any);