Apache Tika - 一次从文件中读取块?

时间:2014-08-25 19:16:36

标签: apache apache-tika

有没有办法使用Tika API从文件中一次读取块(而不是读取整个文件)?

以下是我的代码。如您所见,我正在读取整个文件。我想一次阅读chunk并创建一个文本文件的内容。

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();

....
p.parse(stream,handler,meta, context);
...

String content = handler.toString();

1 个答案:

答案 0 :(得分:1)

现在(现在)和Apache Tika示例显示了如何捕获纯文本输出,并根据块的最大允许大小以块的形式返回。您可以在ContentHandlerExample - method is parseToPlainTextChunks

中找到它

基于此,如果您想要输出到文件而不是每个块,您可以将其调整为:

final int MAXIMUM_TEXT_CHUNK_SIZE = 100 * 1024 * 1024;
final File outputDir = new File("/tmp/");

private class ChunkHandler extends ContentHandlerDecorator {
   private int size = 0;
   private int fileNumber = -1;
   private OutputStreamWriter out = null;

   @Override
   public void characters(char[] ch, int start, int length) throws IOException {
      if (out == null || size+length > MAXIMUM_TEXT_CHUNK_SIZE) {
         if (out != null) out.close();
         fileNumber++;
         File f = new File(outputDir, "output-" + fileNumber + ".txt);
         out = new OutputStreamWriter(new FileOutputStream(f, "UTF-8"));
      }
      out.write(ch, start, length);
   }
   public void close() throws IOException {
      if (out != null) out.close();
   }
}

public void parse(File file) {
   InputStream stream = new FileInputStream(file);
   Parser p = new AutoDetectParser();
   Metadata meta =new Metadata();
   ContentHandler handler = new ChunkHandler();
   ParseContext parse = new ParseContext();

   p.parse(stream,handler,meta, context);
   ((ChunkHandler)handler).close();
}

这将为您提供给定目录中的纯文本文件,不超过最大大小。所有html标签都将被忽略,您只能获得纯文本内容