Question

我目前正在努力通过Solr 4.10（CDH 5.14）中的日期范围查询从NRT索引中获得约1800万个文档核心的出色性能。我尝试了多种策略，但一切似乎都失败了。

每个文档都有多个版本（10到100），在不同的非重叠时间段（startTime / endTime）内有效。

查询模式如下：在referenceNumber（或其他条件）上查询，但仅返回在referenceDate（日期精度）有效的文档。 75％的查询在过去30天内选择了referenceDate。如果我们在没有referenceDate的情况下进行查询，则性能会非常出色，但是使用额外的referenceDate过滤器的速度会降低100倍，即使将其强制为后过滤器也是如此。

以下是一些通过python脚本执行的性能测试，这些脚本执行http查询并计算100个不同referenceNumber的QTime。

+----+-------------------------------------+----------------------+--------------------------+
| ID | Query                               | Results              | Comment                  |
+----+-------------------------------------+----------------------+--------------------------+
| 1  | q=referenceNumber:{referenceNumber} | 100 calls in <10ms   | Performance OK           |
+----+-------------------------------------+----------------------+--------------------------+
| 2  | q=referenceNumber:{referenceNumber} | 99 calls in <10ms    | 1 call to warm up        |
|    | &fq=startDate:[* to NOW/DAY]        | 1 call   in >=1000ms | the cache then all       |
|    | AND    endDate:[NOW/DAY to *]       |                      | queries hit the filter   |
|    |                                     |                      | cache. Problem: as       |
|    |                                     |                      | soon as new documents    |
|    |                                     |                      | come in, they invalidate |
|    |                                     |                      | the cache.               |
+----+-------------------------------------+----------------------+--------------------------+
| 3  | q=referenceNumber:{referenceNumber} | 99 calls in >=500ms  | The average of           |
|    | &fq={!cache=false cost=200}         | 1  call  in >=1000ms | calls is 734.5ms.        |
|    | startDate:[* to NOW/DAY]            |                      |                          |
|    | AND    endDate:[NOW/DAY to *]       |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+

附加的日期范围过滤器查询怎么可能导致速度降低100倍？在此博客中，我期望daterange查询的性能与没有附加过滤器的情况类似：http://yonik.com/advanced-filter-caching-in-solr/

还是唯一的选择是更改softCommit / hardCommit延迟，在过去30天中创建30个预热FQ，并容忍25％的查询性能不佳？

编辑1：不幸的是，感谢您的回答，使用整数代替tdate似乎并没有带来任何性能提升。它只能利用缓存，例如上面的查询ID 2。这意味着我们需要一个30平方英尺以上的预热策略。

+----+-------------------------------------+----------------------+--------------------------+
| ID | Query                               | Results              | Comment                  |
+----+-------------------------------------+----------------------+--------------------------+
| 4  | fq={!cache=false}                   | 35 calls in <10ms    |                          |
|    | referenceNumber:{referenceNumber}   | 65 calls in >10ms    |                          |
+----+-------------------------------------+----------------------+--------------------------+
| 5  | fq={!cache=false}                   | 9 calls in >100ms    |                          |
|    | referenceNumber:{referenceNumber}   | 6 calls in >500ms    |                          |
|    | AND versionNumber:[2 TO *]          | 85 calls in >1000ms  |                          |
+----+-------------------------------------+----------------------+--------------------------+

edit 2：似乎将我的referenceNumber从fq传递到q并设置不同的开销可以改善查询时间（虽然不完美，但更好）。不过奇怪的是，应该将成本> = 100作为postFilter执行，但是将成本设置为20到200似乎根本不会影响性能。有谁知道如何查看fq参数是否作为后过滤器执行？

+----+-------------------------------------+----------------------+--------------------------+
| 6  | fq={!cache=false cost=0}            | 89 calls in >100ms   |                          |
|    | referenceNumber:{referenceNumber}   | 11 calls in >500ms   |                          |
|    | &fq={!cache=false cost=200}         |                      |                          |
|    | startDate:[* TO NOW] AND            |                      |                          |
|    | endDate:[NOW TO *]                  |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+
| 7  | fq={!cache=false cost=0}            | 36 calls in >100ms   |                          |
|    | referenceNumber:{referenceNumber}   | 64 calls in >500ms   |                          |
|    | &fq={!cache=false cost=20}          |                      |                          |
|    | startDate:[* TO NOW] AND            |                      |                          |
|    | endDate:[NOW TO *]                  |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+

Answer 1

嗨，我为您提供了另一种解决方案，在对Solr执行相同的查询后，它将提供良好的性能。

My Suggestion is store date in int format, please find below example.

 Your Start Date : 2017-03-01
 Your END Date : 2029-03-01

**Suggested format in int format. 
 Start Date : 20170301
 END Date : 20290301**

当您尝试使用int编号而不是日期触发相同的查询时，它会按预期运行。

 So your query will be.
q=referenceNumber:{referenceNumber}
&fq=startNewDate:[* to YYMMDD]
AND    endNewDate:[YYMMDD to *]

希望它会帮助您..

Solr在日期范围查询上的性能不佳

1 个答案: