Question

概述

我想实现一个Lucene Indexer / Searcher，它使用新的Payload功能，允许将元信息添加到文本中。在我的特定情况下，我将权重（可以理解为％概率，介于0和100之间）添加到概念标签，以便使用它们来覆盖标准的Lucene TF-IDF加权。我对这种行为感到困惑，我相信相似性课程有些问题，我覆盖了，但我无法理解。

示例

当我运行搜索查询（例如“concept：red”）时，我发现每个有效负载始终是通过MyPayloadSimilarity传递的第一个数字（在代码示例中，这是1.0）而不是1.0,50.0和100.0。结果，所有文档都获得相同的有效载荷和相同的分数。但是，数据应该是图片＃1，有效载荷为100.0，其次是图片＃2，其次是图片＃3和非常多样的分数。我无法听到我的声音。

以下是试运行的结果：

Query: concept:red
===>  docid: 0 payload: 1.0
===>  docid: 1 payload: 1.0
===>  docid: 2 payload: 1.0
Number of results:3
-> docid: 3.jpg score: 0.2518424
-> docid: 2.jpg score: 0.2518424
-> docid: 1.jpg score: 0.2518424

有什么问题？我是否误解了Payloads的问题？

代码

附上我作为一个独立的示例共享我的代码，以便您在运行它时尽可能轻松，如果您考虑此选项。

public class PayloadShowcase {

 public static void main(String s[]) {
     PayloadShowcase p = new PayloadShowcase();
     p.run();
 }

public void run () {
    // Step 1: indexing
    MyPayloadIndexer indexer = new MyPayloadIndexer();
    indexer.index();
    // Step 2: searching
    MyPayloadSearcher searcher = new MyPayloadSearcher();
    searcher.search("red");
}

public class MyPayloadAnalyzer extends Analyzer {

    private PayloadEncoder encoder;
    MyPayloadAnalyzer(PayloadEncoder encoder) {
        this.encoder = encoder;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(reader);
        TokenStream filter = new LowerCaseFilter(source);
        filter = new DelimitedPayloadTokenFilter(filter, '|', encoder);
        return new TokenStreamComponents(source, filter);
    }
}

public class MyPayloadIndexer {

    public MyPayloadIndexer() {}

    public void index() {
        try {
            Directory dir = FSDirectory.open(new File("D:/data/indices/sandbox"));
            Analyzer analyzer = new MyPayloadAnalyzer(new FloatEncoder());
            IndexWriterConfig iwconfig = new IndexWriterConfig(Version.LUCENE_4_10_1, analyzer);
            iwconfig.setSimilarity(new MyPayloadSimilarity());
            iwconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

            // load mappings and classifiers
            HashMap<String, String> mappings = this.loadDataMappings();
            HashMap<String, HashMap> cMaps = this.loadData();

            IndexWriter writer = new IndexWriter(dir, iwconfig);
            indexDocuments(writer, mappings, cMaps);
            writer.close();

        } catch (IOException e) {
            System.out.println("Exception while indexing: " + e.getMessage());
        }
    }

    private void indexDocuments(IndexWriter writer, HashMap<String, String> fileMappings, HashMap<String, HashMap> concepts) throws IOException {

        Set fileSet = fileMappings.keySet();
        Iterator<String> iterator = fileSet.iterator();
        while (iterator.hasNext()){
            // unique file information
            String fileID = iterator.next();
            String filePath = fileMappings.get(fileID);
            // create a new, empty document
            Document doc = new Document();
            // path of the indexed file
            Field pathField = new StringField("path", filePath, Field.Store.YES);
            doc.add(pathField);
            // lookup all concept probabilities for this fileID
            Iterator<String> conceptIterator = concepts.keySet().iterator();
            while (conceptIterator.hasNext()){
                String conceptName = conceptIterator.next();
                HashMap conceptMap = concepts.get(conceptName);
                doc.add(new TextField("concept", ("" + conceptName + "|").trim() + (conceptMap.get(fileID) + "").trim(), Field.Store.YES));
            }
            writer.addDocument(doc);
        }
    }

    public HashMap<String, String> loadDataMappings(){
        HashMap<String, String> h = new HashMap<>();
        h.put("1", "1.jpg");
        h.put("2", "2.jpg");
        h.put("3", "3.jpg");
        return h;
    }

    public HashMap<String, HashMap> loadData(){
        HashMap<String, HashMap> h = new HashMap<>();
        HashMap<String, String> green = new HashMap<>();
        green.put("1", "50.0");
        green.put("2", "1.0");
        green.put("3", "100.0");
        HashMap<String, String> red = new HashMap<>();
        red.put("1", "100.0");
        red.put("2", "50.0");
        red.put("3", "1.0");
        HashMap<String, String> blue = new HashMap<>();
        blue.put("1", "1.0");
        blue.put("2", "50.0");
        blue.put("3", "100.0");
        h.put("green", green);
        h.put("red", red);
        h.put("blue", blue);
        return h;
    }
}

class MyPayloadSimilarity extends DefaultSimilarity {

    @Override
    public float scorePayload(int docID, int start, int end, BytesRef payload) {
        float pload = 1.0f;
        if (payload != null) {
            pload = PayloadHelper.decodeFloat(payload.bytes);
        }
        System.out.println("===>  docid: " + docID + " payload: " + pload);
        return pload;
    }
}

public class MyPayloadSearcher {

    public MyPayloadSearcher() {}

    public void search(String queryString) {
        try {
            IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("D:/data/indices/sandbox")));
            IndexSearcher searcher = new IndexSearcher(reader);
            searcher.setSimilarity(new PayloadSimilarity());
            PayloadTermQuery query = new PayloadTermQuery(new Term("concept", queryString),
                    new AveragePayloadFunction());
            System.out.println("Query: " + query.toString());
            TopDocs topDocs = searcher.search(query, 999);
            ScoreDoc[] hits = topDocs.scoreDocs;
            System.out.println("Number of results:" + hits.length);

            // output
            for (int i = 0; i < hits.length; i++) {
                Document doc = searcher.doc(hits[i].doc);
                System.out.println("-> docid: " + doc.get("path") + " score: " + hits[i].score);
            }
            reader.close();

        } catch (Exception e) {
            System.out.println("Exception while searching: " + e.getMessage());
        }
    }
}

}

Answer 1

在MyPayloadSimilarity，PayloadHelper.decodeFloat来电不正确。在这种情况下，还需要传递payload.offset param，如下所示：

pload = PayloadHelper.decodeFloat(payload.bytes, payload.offset);

我希望它有所帮助。

Lucene：有效负载和相似性函数---总是相同的有效负载值

1 个答案: