Question

我想从我的Indexer文件中读取索引。

所以我想要的结果是每个文件的所有术语和TF-IDF的数量。

请为我推荐一些示例代码。谢谢：）

Answer 1

首先要获得一份文件清单。替代方案可能是迭代索引术语，但方法IndexReader.terms()似乎已从4.0中删除（尽管它存在于AtomicReader中，这可能值得一看）。我知道获取所有文档的最佳方法是通过文档id：

简单地遍历文档

//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;
    //operate on the document with id = i ...
}

然后您需要列出所有索引条款。我假设我们对存储的字段没兴趣，因为你想要的数据对它们没有意义。要检索这些术语，您可以使用IndexReader.getTermVectors(int)。注意，我实际上并没有检索文档，因为我们不需要直接访问它。继续我们离开的地方：

String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();

for (int i=0; i<maxDoc; i++) {
    if (reader.isDeleted(i))
        continue;
    fieldsiterator = reader.getTermVectors(i).iterator();
    while (field = fieldsiterator.next()) {
        termsiterator = fieldsiterator.terms().iterator();
        while (terms.next()) {
            //id = document id, field = field name
            //String representations of the current term
            String termtext = termsiterator.term().utf8ToString();
            //Get idf, using docfreq from the reader.
            //I haven't tested this, and I'm not quite 100% sure of the context of this method.
            //If it doesn't work, idfalternate below should.
            int idf = termsiterator.docfreq();
            int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
        }
    }
}

我如何阅读和打印Lucene索引4.0

1 个答案: