我试图从Lucene(5.3)生成的索引文件中获取术语频率和文档频率。实施如下所示:
private static void showIndex(String iNDEX_DIR2) throws IOException {
// TODO Auto-generated method stub
System.out.println("INDEX_DIR:" + iNDEX_DIR2);
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(iNDEX_DIR2).toPath()));
int num_doc = reader.numDocs();
System.out.println("number of docs: "+String.valueOf(num_doc));
for(int docNum=0; docNum<num_doc; docNum++){
Document doc = reader.document(docNum);
System.out.println("Processing file:"+doc.get("id"));
System.out.println("doc is null? "+ String.valueOf(doc==null));
Terms termVector = reader.getTermVector(docNum, "content");
TermsEnum itr = termVector.iterator();
BytesRef term = null;
while((term = itr.next()) != null){
try{
String termText = term.utf8ToString();
Term termInstance = new Term("contents",term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}catch(Exception e){
System.out.println(e);
}
}
}
}
当我运行代码段时,我收到了信息:
INDEX_DIR:F:\Information Retrieval\project\TEST\INDEX
number of docs: 4
Processing file:null
doc is null? false
Exception in thread "main" java.lang.NullPointerException
at IndexManager.showIndex
但是,它表明doc不是null。
有人可以帮我解决这个问题吗? 非常感谢!
答案 0 :(得分:1)
我猜想NPE会被抛弃:
TermsEnum itr = termVector.iterator();
如果字段未与TermVectors一起存储,则 IndexReader.getTermVector
返回null,例如,TextField
不是。
您可以在FieldType中设置一个字段来存储TermVectors。如果需要带有TermVectors的TextField,可以将TextField的FieldType传递给FieldType构造函数以创建它的可变副本,例如:
FieldType myFieldType = new FieldType(TextField.TYPE_STORED);
myFieldType.setStoreTermVectors(true);
doc.add(new Field("contents", fieldContents, myFieldType));