Lucene,查询中的某些关键字(例如,范围查询中的“TO”)区分大小写

时间:2015-06-30 13:04:58

标签: lucene

在Lucene中,由于标准分析器,搜索默认情况下对用户不敏感。这是用户期望的,并且工作正常。

但是,对于范围查询中的“TO”或“AND”/“OR”等几个词,这些关键字区分大小写。这不是用户的期望。

  1. 这有什么理由吗? Lucene在默认情况下基本上“正常工作”,所以有点惊讶。也许背后有一个很好的理由,我不应该碰它。
  2. 我如何才能使这些关键字不区分大小写?由于默认情况下查询的其余部分不区分大小写,我可以将整个查询转换为大写?如果我这样做,我会遇到什么问题吗?还有更好的方法吗?

1 个答案:

答案 0 :(得分:2)

Is there a reason for this?

The real question here might not be "why does lucene do this?", but rather "why does google do this?", as I believe Google's use of this pattern predates Lucene's. Regardless, though, the reasoning isn't too hard to deduce. There needs to be a way of differentiating the word "and" from the the query operator "AND".

Say my query is: Jack and Jill went up the hill

I'm just searching a phrase that happens to contain the word "and". The end result I want is (eliminating stop words, and such):

field:jack field:jill field:went field:up field:hill

Rather than:

+field:jack +field:jill field:went field:up field:hill

If the word is uppercased, it's a decent indicator the user intended the word as an operator.

If all ands became operands, users might be confused why a search for "bread and butter pickles" (becomes +bread +butter pickles) turns up hits about toast, but not about other types of pickles.

Similar for lists of things, like "Abby, Ben, Chris, Dave and Elmer" (becomes abby ben chris +dave +elmer), which all hits would require Dave and Elmer to be present, but the rest of the names would be optional.


How to make them case insensitive?

Uppercasing the whole thing, or every instance of an AND, OR or TO, could be a bit promblematic. Take these, for example:

  • [to TO tz] works, [TO TO TZ] throws an exception
  • and another thing works, AND ANOTHER THING throws an exception

You could check for a ParseException after uppercasing, and try parsing the original query in that case. Might create a bit of an inconsistency, but it beats just failing entirely.