假设我已经索引了一组文档,现在给出了一组我知道由索引过程生成的术语。我想得到每个术语的出现,即哪个文件,什么抵消。对于每个术语,我都使用了一个postnums,让我可以遍历一个术语出现的文档集;然后在每个文档中,一个postingenums获取包含该文档中该术语的偏移信息的文档向量。
但这不是很有效,因为循环内部循环并且可能会非常慢。代码如下。如果可以以更好的方式完成任何建议吗?
字段架构:
<field name="terms" type="token_ngram" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>
代码:
IndexReader indexReader = ...//init an index reader
Set<String> termSet = .... //set containing e.g., 10000 terms.
for(String term: termSet){
//get a postingenum used to iterate through docs containing the term
//this "postings" does not have valid offset information (see comment below)
PostingsEnum postings =
MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()));
/*I also tried:
*PostingsEnum postings =
* MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()), PostingsEnum.OFFSETS);
* But the resulting "postings" object also does not contain valid offset info (always -1)
*/
//now go through each document
int docId = postings.nextDoc();
while (docId != PostingsEnum.NO_MORE_DOCS) {
//get the term vector for that document.
TermsEnum it = indexReader.getTermVector(docId, ngramInfoFieldname).iterator();
//find the term of interest
it.seekExact(new BytesRef(term.getBytes()));
//get its posting info. this will contain offset info
PostingsEnum postingsInDoc = it.postings(null, PostingsEnum.OFFSETS);
//From below, Line A to Line B if I replace "postingsInDoc" with "postings", method "posting.startOffset()" and "endoffset()" always returns -1;
postingsInDoc.nextDoc(); //line A
int totalFreq = postingsInDoc.freq();
for (int i = 0; i < totalFreq; i++) {
postingsInDoc.nextPosition();
System.out.println(postingsInDoc.startOffset(), postingsInDoc.endOffset());
} //Line B
docId=postings.nextDoc();
}
}