当我搜索“iphone”时,我有以下记录和分数 -
记录1: FieldName - DisplayName:“Iphone” FieldName - 名称:“Iphone”
11.654595 = (MATCH) sum of:
11.654595 = (MATCH) max plus 0.01 times others of:
7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=DisplayName, doc=915195)
11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=Name, doc=915195)
RECORD2: FieldName - DisplayName:“Iphone Book” FieldName - 名称:“Iphone Book”
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=453681)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=453681)
RECORD3: FieldName - DisplayName:“iPhone” FieldName - 名称:“iPhone”
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=5737775)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=5737775)
当record2有3个单词且record3只有一个单词时,为什么Record2和Record3具有相同的分数。因此Record3应该具有比记录2更高的相关性。为什么Record2和Record3的fieldNorm都相同?
QueryParser:Dismax FieldType:文本字段类型,在solrconfig.xml中是默认的
添加DataFeed:
Record1:Iphone
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"190502",
"EAN":"9780596804299",
"ISBN":"0596804296",
"Author":"Pogue, David",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2009-08-07 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"397",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Pogue, David",
"DisplayName":"Iphone",
"url":"/iphone-pogue-david/books/9780596804299.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Oreilly & Associates Inc",
"Name":"Iphone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780596804299"
}
Record2:Iphone Book
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"94694",
"EAN":"9780321534101",
"ISBN":"0321534107",
"Author":"Kelby, Scott/ White, Terry",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":1,
"COD":0,
"PublicationDate":"2007-08-13 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Handheld Devices",
"Binding":"Paperback",
"BAMcategory_src":"Computers",
"Category_fq":"Computers",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"219",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Kelby, Scott/ White, Terry",
"DisplayName":"The Iphone Book",
"url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm",
"CurrencyType":"USD",
"SubSubCategory":" Handheld Devices",
"BAMcategory_fq":"Computers",
"Mask":0,
"Publisher":"Pearson P T R",
"Name":"The Iphone Book",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780321534101"
}
记录3:iPhone
{
"ListPrice":278.46,
"ShipsTo":1,
"OID":"694715",
"EAN":"9781411423527",
"ISBN":"1411423526",
"Author":"Quamut (COR)",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"5.9500",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2010-08-03 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"CallOrder":"0",
"TrackInventory":"BNT",
"Author_fq":"Quamut (COR)",
"DisplayName":"iPhone",
"url":"/iphone-quamut-cor/books/9781411423527.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Sterling Pub Co Inc",
"Name":"iPhone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9781411423527"
}
答案 0 :(得分:5)
fieldnorm考虑了字段长度,即术语数量 使用的字段类型是字段显示名称和文本的文本。名称,它将包含停用词和单词分隔符过滤器。
记录1 - Iphone
会生成一个令牌 - IPhone
记录2 - The Iphone Book
会生成2个令牌 - Iphone, Book
这将被停用词删除。
记录3 - iPhone
还会生成2个令牌 - i,phone
由于iPhone有一个大小写更改,带有splitOnCaseChange的单词分隔符过滤器现在会将iPhone拆分为2个标记i,Phone并生成与Record 2相同的字段标准
答案 1 :(得分:3)
这是用户1021590关于" da vinci代码"的后续问题/答案的答案。搜索示例。
所有文档获得相同分数的原因是由于lengthNorm的细微实现细节。 Lucence TFIDFSimilarity doc说明了以下norm(t, d)
:
生成的标准值在存储之前被编码为单个字节。在搜索时,从索引目录中读取范数字节值并将其解码回浮点范数值。这种编码/解码虽然减小了索引大小,但却带来了精度损失的代价 - 无法保证解码(encode(x))= x。例如,decode(encode(0.89))= 0.75。
如果深入研究代码,您会发现这种浮点到字节编码的实现如下:
public static byte floatToByte315(float f)
{
int bits = Float.floatToRawIntBits(f);
int smallfloat = bits >> (24 - 3);
if (smallfloat <= ((63 - 15) << 3))
{
return (bits <= 0) ? (byte) 0 : (byte) 1;
}
if (smallfloat >= ((63 - 15) << 3) + 0x100)
{
return -1;
}
return (byte) (smallfloat - ((63 - 15) << 3));
}
并将该字节解码为float,如下所示:
public static float byte315ToFloat(byte b)
{
if (b == 0)
return 0.0f;
int bits = (b & 0xff) << (24 - 3);
bits += (63 - 15) << 24;
return Float.intBitsToFloat(bits);
}
lengthNorm
计算为1 / sqrt( number of terms in field )
。然后使用floatToByte315
对其进行编码以进行存储。对于包含3个术语的字段,我们得到:
floatToByte315( 1/sqrt(3.0) ) = 120
对于包含4个术语的字段,我们得到:
floatToByte315( 1/sqrt(4.0) ) = 120
所以他们都被解码为:
byte315ToFloat(120) = 0.5
。
该文件还说明了这一点:
支持规范值的这种有损压缩的基本原理是,鉴于用户通过查询表达其真实信息需求的困难(和不准确性),只有很大的差异很重要。
更新:从Solr 4.10开始,此实现和相应的语句是DefaultSimilarity的一部分。