Elasticsearch商店大小是文档字节大小的{1,000}

时间:2017-02-08 00:53:50

标签: elasticsearch

注意:这是在弹性搜索论坛(https://discuss.elastic.co/t/store-size-1-000-times-the-document-byte-size/74258/4)上交叉发布的。

我在store.size上的文件字节大小增加了大约1,000倍。我有一个非常简单的映射与非常小的文档(小于1kb),我已经将我的映射与Elasticsearch的内部映射进行了比较,它们是相同的,因此似乎没有任何动态映射。

到目前为止,我已经摄取了60,437个文档,并且store.size为19.6Gb(平均每个文档300kb),但JSON的平均字节大小(String.getBytes(。。length)是300-400字节每份文件。在另一次运行中,文档平均每个文档大约1MB - 3MB。

我在M4.2xlarge EC2实例上使用Elasticsearch 5.2。除了为了传递boostrap检查并绑定到非本地IP而需要做的事情之外,Elasticsearch几乎都安装了所有默认值。我已经为Elasticsearch分配了16GB(物理内存的一半)。

我以前运行Elasticsearch 2.x并且正在摄取FAR更多的字段和更大的文档,而不仅仅是这些少数字段,并且只有大约20k /文件,虽然可以管理但仍然很大。

如果有人能指出任何可以解决这个问题的事情,我将不胜感激。或者是否有我没见过的ES 5.x配置会解决这个问题?

以下是我的映射。

{
    "settings": {
        "index.query.default_field": "tweetText"
    },
    "mappings": {
        "tweet": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "tweetDate": {
                    "type": "date",
                    "format": "EEE MMM dd HH:mm:ss Z YYYY||strict_date_optional_time||epoch_millis"
                },
                "userId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "screenName": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "tweetText": {
                    "type": "text"
                },
                "cleanedText": {
                    "type": "text"
                },
                "tweetId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "location": {
                    "type": "geo_point",
                    "ignore_malformed": true
                },
                "placeName": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": false
                },
                "placeCountry": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": true
                },
                "placeCountryCode": {
                    "type": "keyword",
                    "doc_values": false,
                    "eager_global_ordinals": false,
                    "index": false
                },
                "placeBoundingBox": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "1m"
                },
                "resolvedUrls": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "hashtags": {
                    "type": "text"
                },
                "mentions": {
                    "type": "text"
                },
                "geoInferences": {
                    "properties": {
                        "matchedName": {
                            "type": "text"
                        },
                        "asciiName": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": false
                        },
                        "country": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": true
                        },
                        "county": {
                            "type": "text"
                        },
                        "countryCode": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "city": {
                            "type": "text"
                        },
                        "admin1Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin2Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin3Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin4Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "confidence": {
                            "type": "float",
                            "doc_values": false,
                            "ignore_malformed": false,
                            "index": false
                        },
                        "coordinates": {
                            "type": "geo_point",
                            "ignore_malformed": true
                        }
                    }
                },
                "temporalInferences": {
                    "type": "date",
                    "ignore_malformed": true
                }
            }
        }
    }
}

示例文档:

{
  "_index": "twitter",
  "_type": "tweet",
  "_id": "AVoZivLca9LOhnR10_ll",
  "_score": null,
  "_source": {
    "tweetDate": 1486487211000,
    "userId": "123456789",
    "screenName": "removed",
    "tweetText": "RT @wef: America’s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
    "cleanedText": "RT @wef: America s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
    "tweetId": "829013568288796672",
    "resolvedUrls": [
      "https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
    ],
    "hashtags": [
      "wef17"
    ],
    "mentions": [
      "wef"
    ],
    "geoInferences": [
      {
        "matchedName": "America",
        "asciiName": "United States",
        "country": "United States",
        "countryCode": "US",
        "coordinates": [
          -98.5,
          39.76
        ],
        "admin1Code": "00",
        "admin2Code": "",
        "admin3Code": "",
        "admin4Code": "",
        "confidence": 1
      }
    ],
    "temporalInferences": [
      1893474000000
    ]
  },
  "fields": {
    "temporalInferences": [
      1893474000000
    ],
    "tweetDate": [
      1486487211000
    ]
  },
  "sort": [
    1486487211000
  ]
}

的输出
GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt,pri,rep,docs.count,store.size,pri.store.size

health | index | pri | rep | docs.count | mt | pri.mt | store.size | pri.store.size | pri.store.size
yellow | twitter | 5 | 1 | 26860 | 74 | 74 | 10.1gb | 10.1gb | 10.1gb

来自:

的输出
GET /twitter/_stats

{
  "_shards": {
    "total": 10,
    "successful": 5,
    "failed": 0
  },
  "_all": {
    "primaries": {
      "docs": {
        "count": 26860,
        "deleted": 0
      },
      "store": {
        "size_in_bytes": 11027965678,
        "throttle_time_in_millis": 0
      },
      "indexing": {
        "index_total": 27397,
        "index_time_in_millis": 3568991,
        "index_current": 1,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 195961
      },
      "get": {
        "total": 0,
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": {
        "open_contexts": 0,
        "query_total": 55,
        "query_time_in_millis": 294,
        "query_current": 0,
        "fetch_total": 36,
        "fetch_time_in_millis": 3209,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },
      "merges": {
        "current": 0,
        "current_docs": 0,
        "current_size_in_bytes": 0,
        "total": 76,
        "total_time_in_millis": 350987,
        "total_docs": 45409,
        "total_size_in_bytes": 4027595474,
        "total_stopped_time_in_millis": 0,
        "total_throttled_time_in_millis": 48633,
        "total_auto_throttle_in_bytes": 82233108
      },
      "refresh": {
        "total": 857,
        "total_time_in_millis": 2994887,
        "listeners": 0
      },
      "flush": {
        "total": 15,
        "total_time_in_millis": 291939
      },
      "warmer": {
        "current": 0,
        "total": 876,
        "total_time_in_millis": 534
      },
      "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
      },
      "fielddata": {
        "memory_size_in_bytes": 24808,
        "evictions": 0
      },
      "completion": {
        "size_in_bytes": 0
      },
      "segments": {
        "count": 139,
        "memory_in_bytes": 186032131,
        "terms_memory_in_bytes": 185758725,
        "stored_fields_memory_in_bytes": 43976,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 77888,
        "points_memory_in_bytes": 714,
        "doc_values_memory_in_bytes": 150828,
        "index_writer_memory_in_bytes": 1316180948,
        "version_map_memory_in_bytes": 42250,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {

        }
      },
      "translog": {
        "operations": 11997,
        "size_in_bytes": 5555179
      },
      "request_cache": {
        "memory_size_in_bytes": 0,
        "evictions": 0,
        "hit_count": 195,
        "miss_count": 195
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time_in_millis": 0
      }
    },
    "total": {
      "docs": {
        "count": 26860,
        "deleted": 0
      },
      "store": {
        "size_in_bytes": 11027965678,
        "throttle_time_in_millis": 0
      },
      "indexing": {
        "index_total": 27397,
        "index_time_in_millis": 3568991,
        "index_current": 1,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 195961
      },
      "get": {
        "total": 0,
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": {
        "open_contexts": 0,
        "query_total": 55,
        "query_time_in_millis": 294,
        "query_current": 0,
        "fetch_total": 36,
        "fetch_time_in_millis": 3209,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },
      "merges": {
        "current": 0,
        "current_docs": 0,
        "current_size_in_bytes": 0,
        "total": 76,
        "total_time_in_millis": 350987,
        "total_docs": 45409,
        "total_size_in_bytes": 4027595474,
        "total_stopped_time_in_millis": 0,
        "total_throttled_time_in_millis": 48633,
        "total_auto_throttle_in_bytes": 82233108
      },
      "refresh": {
        "total": 857,
        "total_time_in_millis": 2994887,
        "listeners": 0
      },
      "flush": {
        "total": 15,
        "total_time_in_millis": 291939
      },
      "warmer": {
        "current": 0,
        "total": 876,
        "total_time_in_millis": 534
      },
      "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
      },
      "fielddata": {
        "memory_size_in_bytes": 24808,
        "evictions": 0
      },
      "completion": {
        "size_in_bytes": 0
      },
      "segments": {
        "count": 139,
        "memory_in_bytes": 186032131,
        "terms_memory_in_bytes": 185758725,
        "stored_fields_memory_in_bytes": 43976,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 77888,
        "points_memory_in_bytes": 714,
        "doc_values_memory_in_bytes": 150828,
        "index_writer_memory_in_bytes": 1316180948,
        "version_map_memory_in_bytes": 42250,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {

        }
      },
      "translog": {
        "operations": 11997,
        "size_in_bytes": 5555179
      },
      "request_cache": {
        "memory_size_in_bytes": 0,
        "evictions": 0,
        "hit_count": 195,
        "miss_count": 195
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time_in_millis": 0
      }
    }
  },
  "indices": {
    "twitter": {
      "primaries": {
        "docs": {
          "count": 26860,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 11027965678,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 27397,
          "index_time_in_millis": 3568991,
          "index_current": 1,
          "index_failed": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
          "delete_current": 0,
          "noop_update_total": 0,
          "is_throttled": false,
          "throttle_time_in_millis": 195961
        },
        "get": {
          "total": 0,
          "time_in_millis": 0,
          "exists_total": 0,
          "exists_time_in_millis": 0,
          "missing_total": 0,
          "missing_time_in_millis": 0,
          "current": 0
        },
        "search": {
          "open_contexts": 0,
          "query_total": 55,
          "query_time_in_millis": 294,
          "query_current": 0,
          "fetch_total": 36,
          "fetch_time_in_millis": 3209,
          "fetch_current": 0,
          "scroll_total": 0,
          "scroll_time_in_millis": 0,
          "scroll_current": 0,
          "suggest_total": 0,
          "suggest_time_in_millis": 0,
          "suggest_current": 0
        },
        "merges": {
          "current": 0,
          "current_docs": 0,
          "current_size_in_bytes": 0,
          "total": 76,
          "total_time_in_millis": 350987,
          "total_docs": 45409,
          "total_size_in_bytes": 4027595474,
          "total_stopped_time_in_millis": 0,
          "total_throttled_time_in_millis": 48633,
          "total_auto_throttle_in_bytes": 82233108
        },
        "refresh": {
          "total": 857,
          "total_time_in_millis": 2994887,
          "listeners": 0
        },
        "flush": {
          "total": 15,
          "total_time_in_millis": 291939
        },
        "warmer": {
          "current": 0,
          "total": 876,
          "total_time_in_millis": 534
        },
        "query_cache": {
          "memory_size_in_bytes": 0,
          "total_count": 0,
          "hit_count": 0,
          "miss_count": 0,
          "cache_size": 0,
          "cache_count": 0,
          "evictions": 0
        },
        "fielddata": {
          "memory_size_in_bytes": 24808,
          "evictions": 0
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 139,
          "memory_in_bytes": 186032131,
          "terms_memory_in_bytes": 185758725,
          "stored_fields_memory_in_bytes": 43976,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 77888,
          "points_memory_in_bytes": 714,
          "doc_values_memory_in_bytes": 150828,
          "index_writer_memory_in_bytes": 1316180948,
          "version_map_memory_in_bytes": 42250,
          "fixed_bit_set_memory_in_bytes": 0,
          "max_unsafe_auto_id_timestamp": -1,
          "file_sizes": {

          }
        },
        "translog": {
          "operations": 11997,
          "size_in_bytes": 5555179
        },
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 195,
          "miss_count": 195
        },
        "recovery": {
          "current_as_source": 0,
          "current_as_target": 0,
          "throttle_time_in_millis": 0
        }
      },
      "total": {
        "docs": {
          "count": 26860,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 11027965678,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 27397,
          "index_time_in_millis": 3568991,
          "index_current": 1,
          "index_failed": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
          "delete_current": 0,
          "noop_update_total": 0,
          "is_throttled": false,
          "throttle_time_in_millis": 195961
        },
        "get": {
          "total": 0,
          "time_in_millis": 0,
          "exists_total": 0,
          "exists_time_in_millis": 0,
          "missing_total": 0,
          "missing_time_in_millis": 0,
          "current": 0
        },
        "search": {
          "open_contexts": 0,
          "query_total": 55,
          "query_time_in_millis": 294,
          "query_current": 0,
          "fetch_total": 36,
          "fetch_time_in_millis": 3209,
          "fetch_current": 0,
          "scroll_total": 0,
          "scroll_time_in_millis": 0,
          "scroll_current": 0,
          "suggest_total": 0,
          "suggest_time_in_millis": 0,
          "suggest_current": 0
        },
        "merges": {
          "current": 0,
          "current_docs": 0,
          "current_size_in_bytes": 0,
          "total": 76,
          "total_time_in_millis": 350987,
          "total_docs": 45409,
          "total_size_in_bytes": 4027595474,
          "total_stopped_time_in_millis": 0,
          "total_throttled_time_in_millis": 48633,
          "total_auto_throttle_in_bytes": 82233108
        },
        "refresh": {
          "total": 857,
          "total_time_in_millis": 2994887,
          "listeners": 0
        },
        "flush": {
          "total": 15,
          "total_time_in_millis": 291939
        },
        "warmer": {
          "current": 0,
          "total": 876,
          "total_time_in_millis": 534
        },
        "query_cache": {
          "memory_size_in_bytes": 0,
          "total_count": 0,
          "hit_count": 0,
          "miss_count": 0,
          "cache_size": 0,
          "cache_count": 0,
          "evictions": 0
        },
        "fielddata": {
          "memory_size_in_bytes": 24808,
          "evictions": 0
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 139,
          "memory_in_bytes": 186032131,
          "terms_memory_in_bytes": 185758725,
          "stored_fields_memory_in_bytes": 43976,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 77888,
          "points_memory_in_bytes": 714,
          "doc_values_memory_in_bytes": 150828,
          "index_writer_memory_in_bytes": 1316180948,
          "version_map_memory_in_bytes": 42250,
          "fixed_bit_set_memory_in_bytes": 0,
          "max_unsafe_auto_id_timestamp": -1,
          "file_sizes": {

          }
        },
        "translog": {
          "operations": 11997,
          "size_in_bytes": 5555179
        },
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 195,
          "miss_count": 195
        },
        "recovery": {
          "current_as_source": 0,
          "current_as_target": 0,
          "throttle_time_in_millis": 0
        }
      }
    }
  }
}

编辑1 我发现了这个问题的根源。虽然我不知道为什么这似乎是错误的边界框。

一旦我从被摄取的数据中删除了边界框,索引就是正常尺寸(600个文档 - > 550kb),但只要我重新添加边界框(带有一个全新的索引),大小突飞猛进(3,593个文件 - > 1.6GB),只有84个文件包含一个边界框。

下面是边界框的JSON:

"placeBoundingBox": {
    "type": "polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ],
        [
          -71.191421,
          42.227797
        ]
      ]
    ]
  }

与边界框关联的映射(来自调用GET / INDEX_NAME):

"placeBoundingBox": {
    "type": "geo_shape",
    "tree": "quadtree",
    "precision": "1.0m"
  }

为了证明映射确实有效并且正在创建一个合适的geo_shape(即使Kibana不将其识别为geo_shape),我运行了以下查询并获得了成功的命中:

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {

        }
      },
      "filter": {
        "geo_shape": {
          "placeBoundingBox": {
            "shape": {
              "type": "polygon",
              "coordinates": [
                [
                  [
                    -71.191421,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.227797
                  ]
                ]
              ]
            },
            "relation": "within"
          }
        }
      }
    }
  }
}

我想保留边界框,是否有映射或数据有问题? 1.0米太精细了吗?

1 个答案:

答案 0 :(得分:0)

问题在于映射的精确度,这只是一个错字(我们的Elasticsearch 2.x索引的精度为1km)。一封小信完全不同......

1米(" 1米")精度会产生极度膨胀的指数。

删除"精度"完全映射的字段将默认为50米和一个大小合适的索引。