如何使用Lucene索引非常大的文件(OutOfMemoryError)

时间:2018-02-15 10:59:25

标签: java lucene out-of-memory

如何使用Lucene索引非常大的文本文件?我在下面创建了一个最小的示例,当出现2GB文本文件时会遇到OutOfMemoryError。我希望将FileReader提供给Field构造函数将允许输入文件内容进行流式处理,但似乎并非如此。

 public static void main(String[] args) {
        try {
            String indexDir = "c:/temp/ix";
            SimpleFSDirectory simpleFsDir = new SimpleFSDirectory(Paths.get(indexDir), SimpleFSLockFactory.INSTANCE);
            StandardAnalyzer analyzer = new StandardAnalyzer();
            IndexWriterConfig config = new IndexWriterConfig(analyzer);
            config.setCommitOnClose(true);
            config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);

            IndexWriter writer = new IndexWriter(simpleFsDir, config);

            FieldType storedNotIndexed = new FieldType();
            storedNotIndexed.setStored(true);
            storedNotIndexed.setIndexOptions(IndexOptions.NONE);

            FieldType indexedNotStored = new FieldType();
            indexedNotStored.setStored(false);
            indexedNotStored.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);

            Field idField = new Field("Id", "1", storedNotIndexed);
            Field contentField = new Field("Content", new FileReader("c:/temp/twoGbTextFile.txt"), indexedNotStored);

            Document document = new Document();
            document.add(idField);
            document.add(contentField);

            writer.addDocument(document);

            writer.commit();
        }
        catch(Exception ex){
            System.out.println(ex.toString());
        }
    }
  

线程“main”中的异常java.lang.OutOfMemoryError:Java堆空间     在   org.apache.lucene.index.FreqProxTermsWriterPerField $ FreqProxPostingsArray。(FreqProxTermsWriterPerField.java:209)     在   org.apache.lucene.index.FreqProxTermsWriterPerField $ FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230)     在   org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:46)     在   org.apache.lucene.index.TermsHashPerField $ PostingsBytesStartArray.grow(TermsHashPerField.java:250)     在org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:271)at   org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149)     在   org.apache.lucene.index.DefaultIndexingChain $ PerField.invert(DefaultIndexingChain.java:796)     在   org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447)     在   org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403)     在   org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)     在   org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)     在   org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1569)     在   org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1314)

0 个答案:

没有答案