lucene,有效地获取文档中一组术语的偏移量

时间:2015-09-30 09:52:00

标签: lucene

假设我已经索引了一组文档,现在给出了一组我知道由索引过程生成的术语。我想得到每个术语的出现,即哪个文件,什么抵消。对于每个术语,我都使用了一个postnums,让我可以遍历一个术语出现的文档集;然后在每个文档中,一个postingenums获取包含该文档中该术语的偏移信息的文档向量。

但这不是很有效,因为循环内部循环并且可能会非常慢。代码如下。如果可以以更好的方式完成任何建议吗?

字段架构:

<field name="terms" type="token_ngram" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>

代码:

IndexReader indexReader = ...//init an index reader
Set<String> termSet = .... //set containing e.g., 10000 terms.
for(String term: termSet){
    //get a postingenum used to iterate through docs containing the term
    //this "postings" does not have valid offset information (see comment below)
    PostingsEnum postings =
            MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()));
    /*I also tried: 
     *PostingsEnum postings =
     *       MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()), PostingsEnum.OFFSETS);
     * But the resulting "postings" object also does not contain valid offset info (always -1)
     */

    //now go through each document
    int docId = postings.nextDoc();
    while (docId != PostingsEnum.NO_MORE_DOCS) {
        //get the term vector for that document.
        TermsEnum it = indexReader.getTermVector(docId, ngramInfoFieldname).iterator();
        //find the term of interest
        it.seekExact(new BytesRef(term.getBytes()));
        //get its posting info. this will contain offset info
        PostingsEnum postingsInDoc = it.postings(null, PostingsEnum.OFFSETS);

        //From below, Line A to Line B if I replace "postingsInDoc" with "postings", method "posting.startOffset()" and "endoffset()" always returns -1; 
        postingsInDoc.nextDoc(); //line A

        int totalFreq = postingsInDoc.freq();
        for (int i = 0; i < totalFreq; i++) {
            postingsInDoc.nextPosition();
            System.out.println(postingsInDoc.startOffset(), postingsInDoc.endOffset());
        }        //Line B     

        docId=postings.nextDoc();
    }
}

0 个答案:

没有答案