如何使用Apache lucene 6.4.0从IndexReader获得最频繁的术语

时间:2017-01-31 18:48:04

标签: lucene

似乎Apache Lucene api从每个版本都有所改变。如何从Apache lucene 6.4.0的IndexReader中获得最常用的术语。

我看到Get highest frequency terms from Lucene index对Apache Lucene 6.4.0无用。

1 个答案:

答案 0 :(得分:1)

这是适用于Lucene 6.4的代码。它找到所有字段中最常用的术语,用于分别在字段调整代码中查找最常用的术语。

        IndexReader reader = DirectoryReader.open(dir);
        final Fields fields = MultiFields.getFields(reader);
        final Iterator<String> iterator = fields.iterator();

        long maxFreq = Long.MIN_VALUE;
        String freqTerm = "";
        while(iterator.hasNext()) {
            final String field = iterator.next();
            final Terms terms = MultiFields.getTerms(reader, field);
            final TermsEnum it = terms.iterator();
            BytesRef term = it.next();
            while (term != null) {
                final long freq = it.totalTermFreq();
                if (freq > maxFreq) {
                    maxFreq = freq;
                    freqTerm = term.utf8ToString();
                }
                term = it.next();
            }
        }

        System.out.println(freqTerm + " " + maxFreq);