如何为Lucene 4.0启用新的可选BlockPostingsFormat?

时间:2012-10-12 22:07:25

标签: solr lucene

Lucene-Core 4.0的发行说明提到了一个值得注意的变化:

  

•新的“Block”PostingsFormat提供改进的搜索性能和索引压缩。这可能会成为未来版本中的默认格式。

根据此blog post,BlockPostingsFormat会产生较小的索引,并且比以前的格式更快(对于大多数查询)。

但是,我无法在任何地方找到如何在4.0中选择此格式的提及。可以在哪里指定新的BlockPostingsFormat而不是旧的默认值?

2 个答案:

答案 0 :(得分:4)

几个步骤:

  1. 选择一个编解码器。然后“修改”它以使用BlockPostingsFormat作为PostingFormat类。您可以扩展编解码器的类,或使用FilterCodec,这可以覆盖某些编解码器的设置。
  2. 在META-INF / services / org.apache.lucene.codecs.Codec上创建一个文件。它应列出您在上一步中创建的编解码器类的完整类名。这是为了满足Lucene 4加载编解码器的方式。
  3. 致电IndexWriterConfig.setCodec(Codec)以指定您刚创建的编解码器。
  4. 照常使用IndexWriterConfig对象。
  5. 根据Javadoc,BlockPostingsFormat在索引directoy中创建.doc和.pos文件,而Lucene40PostingsFormat创建.frq和.prx文件。所以这是告诉Lucene是否真的使用块发布格式的一种方式。

    我修改了Lucene核心Javadoc中的示例来测试块发布格式。这是代码(希望它有所帮助):


    org.apache.lucene.codecs.Codec

    # See http://www.romseysoftware.co.uk/2012/07/04/writing-a-new-lucene-codec/
    # This file should be in /somewhere_in_your_classpath/META-INF/services/org.apache.lucene.codecs.Codec
    # 
    # List of codecs
    lucene4examples.Lucene40WithBlockCodec
    

    Lucene40WithBlockCodec.java

    package lucene4examples;
    
    import org.apache.lucene.codecs.FilterCodec;
    import org.apache.lucene.codecs.PostingsFormat;
    import org.apache.lucene.codecs.block.BlockPostingsFormat;
    import org.apache.lucene.codecs.lucene40.Lucene40Codec;
    
    // Lucene 4.0 codec with block posting format
    
    public class Lucene40WithBlockCodec extends FilterCodec {
    
        public Lucene40WithBlockCodec() {
        super("Lucene40WithBlock", new Lucene40Codec());
    
        }
    
        @Override
        public PostingsFormat postingsFormat() {
        return new BlockPostingsFormat();
        }
    
    }
    

    BlockPostingsFormatExample.java

    package lucene4examples;
    
    import java.io.File;
    import java.io.IOException;
    
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.document.TextField;
    import org.apache.lucene.index.DirectoryReader;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.IndexWriterConfig;
    import org.apache.lucene.queryparser.classic.ParseException;
    import org.apache.lucene.queryparser.classic.QueryParser;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.FSDirectory;
    import org.apache.lucene.util.Version;
    
    // This example is based on the one that comes with Lucene 4.0.0 core API Javadoc
    // (http://lucene.apache.org/core/4_0_0/core/overview-summary.html)
    
    public class BlockPostingsFormatExample {
    
        public static void main(String[] args) throws IOException, ParseException {
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
    
        // Store the index on disk:
        Directory directory = FSDirectory.open(new File(
            "/index_dir"));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
            analyzer);
    
        // If the following line of code is commented out, the original Lucene
        // 4.0 codec will be used.
        // Else, the Lucene 4.0 codec with block posting format
        // (http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html)
        // will be used.
        config.setCodec(new Lucene40WithBlockCodec());
    
        IndexWriter iwriter = new IndexWriter(directory, config);
        Document doc = new Document();
        String text = "This is the text to be indexed.";
        doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
        iwriter.addDocument(doc);
        iwriter.close();
    
        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser(Version.LUCENE_40, "fieldname",
            analyzer);
        Query query = parser.parse("text");
        ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
        System.out.println("hits.length = " + hits.length);
        // Iterate through the results:
        for (int i = 0; i < hits.length; i++) {
            Document hitDoc = isearcher.doc(hits[i].doc);
            System.out.println("text: " + hitDoc.get("fieldname"));
        }
        ireader.close();
        directory.close();
        }
    
    }
    

答案 1 :(得分:3)

按照此处的说明操作,但使用BlockPostingsFormat而不是SimpleText。

http://wiki.apache.org/solr/SimpleTextCodecExample