为什么在Elasticsearch中,关键字类型比文本占用更多的空间?

时间:2018-07-14 02:47:29

标签: elasticsearch elasticsearch-5

env:ElasticSearch 5.5.1

首先,我的Elasticsearch中有两个索引 并且两个索引的唯一不同是message字段,index1中的消息类型是关键字,而index2中的文本是文本。

为了确保它不受其他字段的影响,我删除了消息字段并比较了结果的前后:

在删除消息之前:

before remove message field image

删除我得到的消息字段后:

after remove message field image

很明显,消息字段占用了大量空间,关键字的类型比文本占用的空间大得多,但是我不知道为什么关键字比文本占用的空间大得多? 所以,有人帮我吗?

以下是index1的映射信息的索引:

    "mappings": {
      "system": {
        "dynamic": "true",
        "_all": {
          "enabled": false
        },
        "dynamic_date_formats": [
          "yyyy-MM-dd HH:mm:ss.SSS"
        ],
        "dynamic_templates": [
          {
            "geo2": {
              "match": "*_geo",
              "mapping": {
                "type": "geo_point"
              }
            }
          },
          {
            "strings2": {
              "match_mapping_type": "string",
              "mapping": {
                "type": "keyword"
              }
            }
          }
        ],
        "numeric_detection": false,
        "properties": {
          "@agent_timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "@timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "Kafkaspeed": {
            "type": "keyword"
          },
          "_index_name": {
            "type": "keyword"
          },
          "count": {
            "type": "long"
          },
          "datex": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "defaultWriteESspeed": {
            "type": "double"
          },
          "filepathname": {
            "type": "keyword"
          },
          "jsonmessage": {
            "type": "text"
          },
          "key": {
            "type": "keyword"
          },
          "logcount": {
            "type": "long"
          },
          "loglevel": {
            "type": "keyword"
          },
          "message": {
            "type": "keyword"
          },
          "paredspeed": {
            "type": "float"
          },
          "seccount": {
            "type": "long"
          },
          "sn": {
            "type": "long"
          },
          "sourceName": {
            "type": "keyword"
          },
          "sourceip": {
            "type": "keyword"
          },
          "sourcename": {
            "type": "keyword"
          },
          "sourceport": {
            "type": "long"
          },
          "sucesscount": {
            "type": "long"
          },
          "time_str": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "timestamp": {
            "type": "long"
          },
          "totalcount": {
            "type": "long"
          },
          "uniqueid": {
            "type": "keyword"
          }
        }
      }
    }

和设置信息:

"settings": {
  "index": {
    "refresh_interval": "1s",
    "number_of_shards": "3",
    "translog": {
      "flush_threshold_size": "1024mb",
      "sync_interval": "60s",
      "durability": "async"
    },
    "provided_name": "index1",
    "creation_date": "1531389785215",
    "analysis": {
      "analyzer": {
        "optionIK": {
          "filter": [
            "word_delimiter"
          ],
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    },
    "number_of_replicas": "0",
    "uuid": "zd8oVbwUQbys1UJ8hJZRmQ",
    "version": {
      "created": "5050099"
    }
  }
}

以下是index2的映射信息的索引:

    "mappings": {
      "system": {
        "dynamic": "true",
        "_all": {
          "enabled": false
        },
        "dynamic_date_formats": [
          "yyyy-MM-dd HH:mm:ss.SSS"
        ],
        "dynamic_templates": [
          {
            "geo2": {
              "match": "*_geo",
              "mapping": {
                "type": "geo_point"
              }
            }
          },
          {
            "strings2": {
              "match_mapping_type": "string",
              "mapping": {
                "type": "keyword"
              }
            }
          }
        ],
        "numeric_detection": false,
        "properties": {
          "@agent_timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "@timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "CommunicationReturnCode": {
            "type": "keyword"
          },
          "Kafkaspeed": {
            "type": "keyword"
          },
          "_index_name": {
            "type": "keyword"
          },
          "action": {
            "type": "keyword"
          },
          "count": {
            "type": "long"
          },
          "datex": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "defaultWriteESspeed": {
            "type": "double"
          },
          "filepathname": {
            "type": "keyword"
          },
          "jsonmessage": {
            "type": "text"
          },
          "key": {
            "type": "keyword"
          },
          "logcount": {
            "type": "long"
          },
          "loglevel": {
            "type": "keyword"
          },
          "message": {
            "type": "text"
          },
          "msgid": {
            "type": "keyword"
          },
          "msgname": {
            "type": "keyword"
          },
          "nodetype": {
            "type": "keyword"
          },
          "orgid": {
            "type": "keyword"
          },
          "orgname": {
            "type": "keyword"
          },
          "paredspeed": {
            "type": "float"
          },
          "processingState": {
            "type": "keyword"
          },
          "processingStatecode": {
            "type": "keyword"
          },
          "seccount": {
            "type": "long"
          },
          "sn": {
            "type": "long"
          },
          "sourceName": {
            "type": "keyword"
          },
          "sourceip": {
            "type": "keyword"
          },
          "sourcename": {
            "type": "keyword"
          },
          "sourceport": {
            "type": "long"
          },
          "sucesscount": {
            "type": "long"
          },
          "thread": {
            "type": "keyword"
          },
          "time_str": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },
          "timestamp": {
            "type": "long"
          },
          "totalcount": {
            "type": "long"
          },
          "transDescription": {
            "type": "keyword"
          },
          "transactionErrorCode": {
            "type": "keyword"
          },
          "transactionTimeConsuming": {
            "type": "keyword"
          },
          "transcode": {
            "type": "keyword"
          },
          "uniqueid": {
              "type": "keyword"
          }
        }
      }
    }

和设置信息:

"settings": {
  "index": {
    "refresh_interval": "1s",
    "number_of_shards": "2",
    "translog": {
      "flush_threshold_size": "1024mb",
      "sync_interval": "60s",
      "durability": "async"
    },
    "provided_name": "index2",
    "creation_date": "1531467294314",
    "analysis": {
      "analyzer": {
        "optionIK": {
          "filter": [
            "word_delimiter"
          ],
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    },
    "number_of_replicas": "0",
    "uuid": "yROU2MrMTzip4VXH_zWEXQ",
    "version": {
      "created": "5050099"
    }
  }
}

以下是关于文本类型字段的两个分片的索引文件结构之一: text field 和关键字类型字段: keyword field

您可以相信两个文件夹中的文档数相同,而该字段的唯一区别是消息字段的类型。

您能解释一下吗? 非常感谢!

1 个答案:

答案 0 :(得分:0)

在Elasticsearch中,关键字字段默认情况下启用doc_values,而文本字段则未启用。这意味着在您的关键字字段上,它将以面向列的方式存储整个字段,以便能够执行聚合或排序,而无需依赖fielddata

此外,一旦对字符串进行标记,加上词干,小写字母等,就可以实现更好的压缩。

如果您不对该字段执行汇总或排序,则可以尝试禁用该字段上的doc_values。