Question

我正在使用新发布的Lucene 4，我理解与文档术语向量相关的API已经发生了很大变化。我已经阅读了迁移文档和相关的各种各样的博客邮件列表帖子，我相信我正在使用API。但是，我总是从IndexReader.getTermVector（）获得一个null Term引用。这就是我正在做的事情：

// Indexing, given "bodyString" as a String containing document text
Document doc = new Document();
doc.add(new TextField("body", bodyString, Field.Store.YES));
MyIndexWriter.addDocument(doc);


// much later, enumerating document term vectors for "body" field for every doc
for (int i = 0; i < Reader.maxDoc(); ++i) {
  final Terms terms = Reader.getTermVector(i, "body");
  if (terms != null) {
    int numTerms = 0;
    // record term occurrences for corpus terms above threshold
    term = terms.iterator(term);
    while (term.next() != null) {
      ++numTerms;
    }
    System.out.println("Document " + i + " had " + numTerms + " terms");
  }
  else {
    System.err.println("Document " + i + " had a null terms vector for body");
  }
}

当然，它打印出每个doc都有空术语向量，即Reader.getTermVector（i，“body”）总是返回null。

当我查看Luke中的索引时，我有存储了body字段的文档。但是，当我点击“电视”按钮（在“文档”选项卡中），同时突出显示正文字段时，卢克告诉我“术语向量不可用”。索引时是否需要添加其他类型的选项来记录此信息？

有什么想法吗？谢谢！

乔恩

更新我应该注意，有问题的IndexReader是SlowCompositeReaderWrapper的一个实例，它正在包裹DirectoryReader。我使用SlowCompositeReaderWrapper因为我也想要语料库术语频率，并且不清楚如何迭代所有IndexReader叶子上的所有文档（文档ID是否可以在它们之间重复使用？等）

SlowCompositeReaderWrapper是罪魁祸首吗？

Answer 1

根据TextField API，它是“一个被索引和标记化的字段，没有术语向量。”如果您希望存储TermVectors，则应使用Field，并将其设置为在FieldType中存储TermVectors。

类似的东西：

Document doc = new Document();
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true);
Field field = new Field("body", bodyString, type);
doc.add(field);
MyIndexWriter.addDocument(doc);

Answer 2

您正在使用TextField，a field that is indexed and tokenized, without term vectors。这就是你在getTermVector（）上得到null的原因。而不是使用TextField，使用自定义的FieldType构造Field，其中setStoreTermVectors为true。

在lucene 4中，IndexReader.getTermVector（docID，fieldName）为每个doc返回null

2 个答案: