有没有办法使用Tika API从文件中一次读取块(而不是读取整个文件)?
以下是我的代码。如您所见,我正在读取整个文件。我想一次阅读chunk并创建一个文本文件的内容。
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
....
p.parse(stream,handler,meta, context);
...
String content = handler.toString();
答案 0 :(得分:1)
现在(现在)和Apache Tika示例显示了如何捕获纯文本输出,并根据块的最大允许大小以块的形式返回。您可以在ContentHandlerExample - method is parseToPlainTextChunks
中找到它基于此,如果您想要输出到文件而不是每个块,您可以将其调整为:
final int MAXIMUM_TEXT_CHUNK_SIZE = 100 * 1024 * 1024;
final File outputDir = new File("/tmp/");
private class ChunkHandler extends ContentHandlerDecorator {
private int size = 0;
private int fileNumber = -1;
private OutputStreamWriter out = null;
@Override
public void characters(char[] ch, int start, int length) throws IOException {
if (out == null || size+length > MAXIMUM_TEXT_CHUNK_SIZE) {
if (out != null) out.close();
fileNumber++;
File f = new File(outputDir, "output-" + fileNumber + ".txt);
out = new OutputStreamWriter(new FileOutputStream(f, "UTF-8"));
}
out.write(ch, start, length);
}
public void close() throws IOException {
if (out != null) out.close();
}
}
public void parse(File file) {
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
ContentHandler handler = new ChunkHandler();
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
((ChunkHandler)handler).close();
}
这将为您提供给定目录中的纯文本文件,不超过最大大小。所有html标签都将被忽略,您只能获得纯文本内容