elasticsearch术语向量映射选项和索引大小

时间:2016-08-28 13:03:02

标签: elasticsearch lucene

ES允许存储term_vectorall信息的各种optionswith_positions_offsetswith_positions_offsets等)。默认选项(即不传递任何显式映射)存储与PUT test_default_text { "mappings": { "doc": { "properties": { "text": { "type": "string" } } } } } PUT test_default_text/doc/1 { "text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users." } 选项相同的信息,但具有较小索引大小。有谁知道为什么?

以下是一些示例(在Sense中):

默认

PUT test_full_text
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type":        "string",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}


PUT test_full_text/doc/1
{
  "text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users."
}

with_offsets_positions

GET test_default_text/_stats/store`

...
  "store": {
        "size_in_bytes": 5661,
        "throttle_time_in_millis": 0
      }
...


GET test_full_text/_stats/store`

...
  "store": {
        "size_in_bytes": 6373,
        "throttle_time_in_millis": 0
      }
...

商店规模

GET test_default_text/doc/1/_termvectors?fields=text

默认映射索引的大小较小,但似乎包含相同的信息,即提交

"term_vector": "yes"

返回包含位置和偏移的术语矢量数据。即使设置{{1}}也会创建一个更大的索引(此处大小为6217),但只返回术语矢量数据默认值的子集,即"较小的" index的大小更大。

这似乎是稳定的,在更大的指数上更加明显。

有谁知道这是什么问题?

谢谢!

1 个答案:

答案 0 :(得分:0)

我觉得行为很古怪,你应该看看

  

示例2.动态生成术语向量

term vectors in current ES的文档页面上出现:

  

未明确存储在索引中的术语向量是   自动计算。以下请求将全部返回   尽管如此,文档1中字段的信息和统计信息   这些术语尚未明确存储在索引中。请注意   字段文本,不会重新生成术语。