按重量调整具有相同名称的特定字段的Lucene搜索结果分数

时间:2011-04-14 02:16:15

标签: java scala lucene

我目前正在使用Lucene作为我们的全文搜索引擎。但我们需要根据特定字段对搜索结果进行排序。

例如,如果我们的索引中有以下三个文档,其中的内容与id字段不同。

    val document01 = new Document()
    val field0100 = new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)
    val field0101 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
    val field0102 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
    document01.add(field0100)
    document01.add(field0101)
    document01.add(field0102)

    val document02 = new Document()
    val field0200 = new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED)
    val field0201 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
    val field0202 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
    document02.add(field0200)
    document02.add(field0201)
    document02.add(field0202)

    val document03 = new Document()
    val field0300 = new Field("id", "3", Field.Store.YES, Field.Index.ANALYZED)
    val field0301 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
    val field0302 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
    document03.add(field0300)
    document03.add(field0301)
    document03.add(field0302)

现在,当我使用IndexSearcher搜索Linux时,我得到了以下结果:

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>

当我搜索Windows时,我得到相同排序的相同结果。

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>

问题是在构建索引时是否可能对特定字段进行加权?例如,如果在搜索时匹配,我希望make field0201得分更高。

换句话说,当我搜索Linux时,我希望按以下顺序得到结果:

Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>

当我搜索Windows时,它仍然是原始排序,如下所示:

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>

我尝试使用field0201.setBoost(),但在搜索LinuxWindows时,它会更改搜索结果的排序。

1 个答案:

答案 0 :(得分:4)

我认为如果您将不同来源的数据放在具有不同名称的字段中,应该是可能的。您可以在索引时设置提升,但如果您使用相同的名称,我认为提升将适用于所有具有相同名称的字段 - 基于setBoost javadoc。所以,如果你这样做:

val field0201 = new Field("content-high", "This is a test: Linux", ...)
field0201.setBoost(1.5f)
val field0202 = new Field("content-low", "This is a test: Windows", ...)

然后使用content-high:Linux content-low:Linux进行查询(使用布尔查询,将两个 子句设置为term Linux ),然后提升内容-high 如果匹配在该字段中,则应增加文档分数。使用explain查看是否有效。