Question

我从一系列文档中用lucene编写了一个索引。我的文档有2个字段，并添加到索引中，如下所示：

Document doc = new Document();
doc.add(new TextField("Title", "I am a title", Field.Store.NO));
doc.add(new TextField("Text", "random text content", Field.Store.NO));
indexWriter.addDocument(doc);

我想阅读索引并获取每个（term，doc）对的Term-Frequency。

如果我只有1个字段，请说“文字”，我会使用以下代码：

IndexReader indexReader = ...;
Terms terms = MultiFields.getTerms(indexReader, "Text"); // get all terms of this field
TermsEnum termsIterator = terms.iterator();
BytesRef term;
// For every term in the "Text" Field:
while ((term = termsIterator.next()) != null) {
    String termString = term.utf8ToString(); // The term
    PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(indexReader,
        "Text", term, PostingsEnum.FREQS);
    int i;
    // For every doc which contains the current term in the "Text" field:
    while ((i = postingsEnum.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
        Document doc = indexReader.document(i); // The document
        int freq = postingsEnum.freq(); // Frequency of term in doc
    }
}

但是，由于我有2个字段（“标题”和“文本”），为了得到（term，doc）对的总词频，我首先需要get every (term, doc) pair frequency for the "Title" field并保存它们在内存中，然后get every (term, doc) pair frequency for the "Text" field并为每个返回的唯一（term，doc）对手动组合它们。

因此，这种方法很可能不止一次地遍历（term，doc）对，因为两者“Title”和“Text”中可能存在相同的（term，doc）对“字段（如果文档在其”标题“和”文本“中具有相同的术语）。

有没有办法让Lucene API反复遍历所有字段组合？（避免多次迭代同一对）

Answer 1

您有两个字段，每个文档需要所有令牌的频率，作为每个字段和文档的频率总和。

请记住，BytesRef（和Integer）实现了Comparable接口：您的令牌流（TermsEnum）和每个相关的文档流（PostingEnum）都是有序的。

所以你有两次合并两个有序流。你不必在内存中保存超过每个流的头部。

如何将多个字段中的Term-Doc频率合并起来？

1 个答案: