如何使用Lucene索引非常大的文本文件?我在下面创建了一个最小的示例,当出现2GB文本文件时会遇到OutOfMemoryError。我希望将FileReader提供给Field构造函数将允许输入文件内容进行流式处理,但似乎并非如此。
public static void main(String[] args) {
try {
String indexDir = "c:/temp/ix";
SimpleFSDirectory simpleFsDir = new SimpleFSDirectory(Paths.get(indexDir), SimpleFSLockFactory.INSTANCE);
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setCommitOnClose(true);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter writer = new IndexWriter(simpleFsDir, config);
FieldType storedNotIndexed = new FieldType();
storedNotIndexed.setStored(true);
storedNotIndexed.setIndexOptions(IndexOptions.NONE);
FieldType indexedNotStored = new FieldType();
indexedNotStored.setStored(false);
indexedNotStored.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
Field idField = new Field("Id", "1", storedNotIndexed);
Field contentField = new Field("Content", new FileReader("c:/temp/twoGbTextFile.txt"), indexedNotStored);
Document document = new Document();
document.add(idField);
document.add(contentField);
writer.addDocument(document);
writer.commit();
}
catch(Exception ex){
System.out.println(ex.toString());
}
}
线程“main”中的异常java.lang.OutOfMemoryError:Java堆空间 在 org.apache.lucene.index.FreqProxTermsWriterPerField $ FreqProxPostingsArray。(FreqProxTermsWriterPerField.java:209) 在 org.apache.lucene.index.FreqProxTermsWriterPerField $ FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230) 在 org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:46) 在 org.apache.lucene.index.TermsHashPerField $ PostingsBytesStartArray.grow(TermsHashPerField.java:250) 在org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:271)at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) 在 org.apache.lucene.index.DefaultIndexingChain $ PerField.invert(DefaultIndexingChain.java:796) 在 org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) 在 org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) 在 org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) 在 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) 在 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1569) 在 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1314)