在汇总中启用doc_value时为什么会看到字段数据?

时间:2018-11-28 10:28:24

标签: elasticsearch

基于Elastic Documents,除文本(经分析的字符串)外,每种类型都支持doc_values,我认为该类型在可用时应在聚合中完全省略fielddata

但是,对我而言并非如此,每当我基于keywordip类型进行术语汇总时,我都会看到它们以fieldata的形式加载,尽管其他情况并未发生类型(例如 session_id 作为long类型)

这是正确的行为吗?如果为true,如何防止创建fielddata

我正在使用Elasticsearch 6.5,这是我的映射

{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "codec": "best_compression"
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "time": {
          "type": "date",
          "format": "epoch_millis"
        },
        "session_token": {
          "type": "keyword"
        },
        "session_ref": {
          "type": "keyword"
        },
        "session_id": {
          "type": "long"
        },
        "src": {
          "type": "ip"
        },
        "version": {
          "type": "byte"
        }
      }
    }
  }
}

这是一个示例聚合,导致fielddata被加载

GET test_ind/_search?size=0
  {
  "aggs" : {
    "by_token":{
      "terms":{ 
        "field": "token",
        "size": 100
      }
    }
  }
  }

这是聚合后的字段数据状态

"test_ind" : {
  "uuid" : "DiB6d7EgSXm7jeiSgoo-mQ",
  "primaries" : {
    "fielddata" : {
      "memory_size_in_bytes" : 1564696,
      "evictions" : 0,
      "fields" : {
        "session_ref" : {
          "memory_size_in_bytes" : 0
        },
        "session_token" : {
          "memory_size_in_bytes" : 1564696
        }
      }
    }
  },
  "total" : {
    "fielddata" : {
      "memory_size_in_bytes" : 1564696,
      "evictions" : 0,
      "fields" : {
        "session_ref" : {
          "memory_size_in_bytes" : 0
        },
        "session_token" : {
          "memory_size_in_bytes" : 1564696
        }
      }
    }
  }
}

这是细分统计

"test_ind" : {
  "uuid" : "DiB6d7EgSXm7jeiSgoo-mQ",
  "primaries" : {
    "segments" : {
      "count" : 8,
      "memory_in_bytes" : 472939,
      "terms_memory_in_bytes" : 423365,
      "stored_fields_memory_in_bytes" : 3504,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 0,
      "points_memory_in_bytes" : 41598,
      "doc_values_memory_in_bytes" : 4472,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "total" : {
    "segments" : {
      "count" : 8,
      "memory_in_bytes" : 472939,
      "terms_memory_in_bytes" : 423365,
      "stored_fields_memory_in_bytes" : 3504,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 0,
      "points_memory_in_bytes" : 41598,
      "doc_values_memory_in_bytes" : 4472,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  }
}

1 个答案:

答案 0 :(得分:0)

Apparently Global Ordinals memory usage are shown in fielddata

可以在映射中将“全局顺序”设置为“急切”或“延迟”,前者将在刷新时强制加载它们,而后者在查询时(默认)强制加载

为防止在术语聚合中使用全局序号,我们可以使用"execution_hint": "map",在我的情况下为:

GET test_ind/_search?size=0
  {
  "aggs" : {
    "by_token":{
      "terms":{ 
        "field": "token",
        "execution_hint": "map"
        "size": 100
      }
    }
  }
  }

尽管它有其自身的警告,但使用更多的内存来执行查询,并且运行速度较慢。