Question

我正在Lucene编制索引，并且只对从Lucene中获取相关文档的ID感兴趣（即，不是字段值或任何突出显示信息）。鉴于这些要求，我应该使用哪个术语向量，而不会影响搜索性能（速度）或质量（结果）？我也将使用MoreLikeThis所以不要

TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information

TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets

TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions

TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets

感谢。

Answer 1

这取决于您的查询类型...如果您的ID有任何相关数据，那么您将需要有职位和/或职位。

如果你有这样的文件： “blah blah blah date blah ID blah name blah”

你只想找到那个特定的ID，然后是TermVector是的。但是，如果您想根据与日期或名称的接近程度来查找ID（使用高级查询），则需要添加额外的术语位置。

你可以随时尝试这个，这是一个很容易的改变，假设你不需要单元测试十亿记录索引或其他东西：）

BTW ......查看我们的“Lucene in Action”这本书涵盖了所有这些信息。

在Lucene中使用哪个术语矢量选项？

1 个答案: