Question

我已经在stackoverflow以及其他在线资源上看到了很多这样或类似的问题。但是，看起来Lucene的API的相应部分发生了很大的变化，所以总结一下：我没有找到任何适用于最新Lucene版本的例子。

我有什么：

Lucene Index + IndexReader + IndexSearcher
一堆文件（及其ID）

我想要的：对于在至少一个所选文档中仅出现的所有术语，我希望为每个文档获得TF-IDF。或者换句话说：我希望任何选定文档中出现的任何术语都是TF-IDF值，例如，作为一个数组（即每个所选文档的一个TF-IDF值）。

非常感谢任何帮助！： - ）

这是我到目前为止所提出的问题，但有两个问题：

它使用临时创建的RAMDirectory，其中仅包含所选文档。有没有办法处理原始索引或没有意义？
它不能获得基于文档的TF IDF，但不知何故仅基于索引，即所有文档。这意味着对于每个术语，我只获得一个TF-IDF值，但不是每个文档和术语一个。

public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {

    Bits liveDocs = MultiFields.getLiveDocs(reader);
    TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
    BytesRef term = null;
    TFIDFSimilarity tfidfSim = new DefaultSimilarity();
    int docCount = reader.numDocs();

    while ((term = termEnum.next()) != null) {
        String termText = term.utf8ToString();
        Term termInstance = new Term(field, term);
        // term and doc frequency in all documents
        long indexTf = reader.totalTermFreq(termInstance); 
        long indexDf = reader.docFreq(termInstance);       
        double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
        // store it, but that's not the problem

Answer 1

totalTermFreq会听到它的声音，提供整个索引的频率。计算中的TF应该是文档中的术语频率，而不是整个索引中的术语频率。这就是为什么你在这里得到的一切都是不变的，你的所有变量在整个索引中是不变的，非依赖于文献。为了获得文档的术语频率，您应该使用DocsEnum.freq()。也许是这样的事情：

while ((term = termEnum.next()) != null) {
    Term termInstance = new Term(field, term);
    long indexDf = reader.docFreq(termInstance);      

    DocsEnum docs = termEnum.docs(reader.getLiveDocs())
    while(docs.next() != DocsEnum.NO_MORE_DOCS) {
        double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
        // store it

Lucene 4.9：从索引中获取一些选定文档的TF-IDF

1 个答案: