如何从Lucene(version5.3)生成的索引中获得术语频率和doc频率

时间:2015-12-19 08:12:27

标签: java lucene

我试图从Lucene(5.3)生成的索引文件中获取术语频率和文档频率。实施如下所示:

private static void showIndex(String iNDEX_DIR2) throws IOException {
    // TODO Auto-generated method stub
    System.out.println("INDEX_DIR:" + iNDEX_DIR2);
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(iNDEX_DIR2).toPath()));

    int num_doc = reader.numDocs();
    System.out.println("number of docs: "+String.valueOf(num_doc));
    for(int docNum=0; docNum<num_doc; docNum++){
        Document doc = reader.document(docNum);
        System.out.println("Processing file:"+doc.get("id"));

        System.out.println("doc is null? "+ String.valueOf(doc==null));
        Terms termVector = reader.getTermVector(docNum, "content");
        TermsEnum itr = termVector.iterator();
        BytesRef term = null;

        while((term = itr.next()) != null){
            try{
                String termText = term.utf8ToString();
                Term termInstance = new Term("contents",term);
                long termFreq = reader.totalTermFreq(termInstance);
                long docCount = reader.docFreq(termInstance);

                System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
            }catch(Exception e){
                System.out.println(e);
            }
        }       
    }
} 

当我运行代码段时,我收到了信息:

INDEX_DIR:F:\Information Retrieval\project\TEST\INDEX
number of docs: 4
Processing file:null
doc is null? false
Exception in thread "main" java.lang.NullPointerException
   at IndexManager.showIndex

但是,它表明doc不是null。

有人可以帮我解决这个问题吗? 非常感谢!

1 个答案:

答案 0 :(得分:1)

我猜想NPE会被抛弃:

TermsEnum itr = termVector.iterator();
如果字段未与TermVectors一起存储,则

IndexReader.getTermVector返回null,例如,TextField不是。

您可以在FieldType中设置一个字段来存储TermVectors。如果需要带有TermVectors的TextField,可以将TextField的FieldType传递给FieldType构造函数以创建它的可变副本,例如:

FieldType myFieldType = new FieldType(TextField.TYPE_STORED);
myFieldType.setStoreTermVectors(true);

doc.add(new Field("contents", fieldContents, myFieldType));