注意:这是在弹性搜索论坛(https://discuss.elastic.co/t/store-size-1-000-times-the-document-byte-size/74258/4)上交叉发布的。
我在store.size上的文件字节大小增加了大约1,000倍。我有一个非常简单的映射与非常小的文档(小于1kb),我已经将我的映射与Elasticsearch的内部映射进行了比较,它们是相同的,因此似乎没有任何动态映射。
到目前为止,我已经摄取了60,437个文档,并且store.size为19.6Gb(平均每个文档300kb),但JSON的平均字节大小(String.getBytes(。。length)是300-400字节每份文件。在另一次运行中,文档平均每个文档大约1MB - 3MB。
我在M4.2xlarge EC2实例上使用Elasticsearch 5.2。除了为了传递boostrap检查并绑定到非本地IP而需要做的事情之外,Elasticsearch几乎都安装了所有默认值。我已经为Elasticsearch分配了16GB(物理内存的一半)。
我以前运行Elasticsearch 2.x并且正在摄取FAR更多的字段和更大的文档,而不仅仅是这些少数字段,并且只有大约20k /文件,虽然可以管理但仍然很大。
如果有人能指出任何可以解决这个问题的事情,我将不胜感激。或者是否有我没见过的ES 5.x配置会解决这个问题?
以下是我的映射。
{
"settings": {
"index.query.default_field": "tweetText"
},
"mappings": {
"tweet": {
"_all": {
"enabled": false
},
"properties": {
"tweetDate": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z YYYY||strict_date_optional_time||epoch_millis"
},
"userId": {
"type": "text",
"index": "not_analyzed"
},
"screenName": {
"type": "text",
"index": "not_analyzed"
},
"tweetText": {
"type": "text"
},
"cleanedText": {
"type": "text"
},
"tweetId": {
"type": "text",
"index": "not_analyzed"
},
"location": {
"type": "geo_point",
"ignore_malformed": true
},
"placeName": {
"type": "keyword",
"doc_values": true,
"eager_global_ordinals": false
},
"placeCountry": {
"type": "keyword",
"doc_values": true,
"eager_global_ordinals": true
},
"placeCountryCode": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"placeBoundingBox": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "1m"
},
"resolvedUrls": {
"type": "text",
"index": "not_analyzed"
},
"hashtags": {
"type": "text"
},
"mentions": {
"type": "text"
},
"geoInferences": {
"properties": {
"matchedName": {
"type": "text"
},
"asciiName": {
"type": "keyword",
"doc_values": true,
"eager_global_ordinals": false
},
"country": {
"type": "keyword",
"doc_values": true,
"eager_global_ordinals": true
},
"county": {
"type": "text"
},
"countryCode": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"city": {
"type": "text"
},
"admin1Code": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"admin2Code": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"admin3Code": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"admin4Code": {
"type": "keyword",
"doc_values": false,
"eager_global_ordinals": false,
"index": false
},
"confidence": {
"type": "float",
"doc_values": false,
"ignore_malformed": false,
"index": false
},
"coordinates": {
"type": "geo_point",
"ignore_malformed": true
}
}
},
"temporalInferences": {
"type": "date",
"ignore_malformed": true
}
}
}
}
}
示例文档:
{
"_index": "twitter",
"_type": "tweet",
"_id": "AVoZivLca9LOhnR10_ll",
"_score": null,
"_source": {
"tweetDate": 1486487211000,
"userId": "123456789",
"screenName": "removed",
"tweetText": "RT @wef: America’s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
"cleanedText": "RT @wef: America s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
"tweetId": "829013568288796672",
"resolvedUrls": [
"https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
],
"hashtags": [
"wef17"
],
"mentions": [
"wef"
],
"geoInferences": [
{
"matchedName": "America",
"asciiName": "United States",
"country": "United States",
"countryCode": "US",
"coordinates": [
-98.5,
39.76
],
"admin1Code": "00",
"admin2Code": "",
"admin3Code": "",
"admin4Code": "",
"confidence": 1
}
],
"temporalInferences": [
1893474000000
]
},
"fields": {
"temporalInferences": [
1893474000000
],
"tweetDate": [
1486487211000
]
},
"sort": [
1486487211000
]
}
的输出
GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt,pri,rep,docs.count,store.size,pri.store.size
health | index | pri | rep | docs.count | mt | pri.mt | store.size | pri.store.size | pri.store.size
yellow | twitter | 5 | 1 | 26860 | 74 | 74 | 10.1gb | 10.1gb | 10.1gb
来自:
的输出GET /twitter/_stats
{
"_shards": {
"total": 10,
"successful": 5,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 26860,
"deleted": 0
},
"store": {
"size_in_bytes": 11027965678,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 27397,
"index_time_in_millis": 3568991,
"index_current": 1,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 195961
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 55,
"query_time_in_millis": 294,
"query_current": 0,
"fetch_total": 36,
"fetch_time_in_millis": 3209,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 76,
"total_time_in_millis": 350987,
"total_docs": 45409,
"total_size_in_bytes": 4027595474,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 48633,
"total_auto_throttle_in_bytes": 82233108
},
"refresh": {
"total": 857,
"total_time_in_millis": 2994887,
"listeners": 0
},
"flush": {
"total": 15,
"total_time_in_millis": 291939
},
"warmer": {
"current": 0,
"total": 876,
"total_time_in_millis": 534
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 24808,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 139,
"memory_in_bytes": 186032131,
"terms_memory_in_bytes": 185758725,
"stored_fields_memory_in_bytes": 43976,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 77888,
"points_memory_in_bytes": 714,
"doc_values_memory_in_bytes": 150828,
"index_writer_memory_in_bytes": 1316180948,
"version_map_memory_in_bytes": 42250,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {
}
},
"translog": {
"operations": 11997,
"size_in_bytes": 5555179
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 195,
"miss_count": 195
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 26860,
"deleted": 0
},
"store": {
"size_in_bytes": 11027965678,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 27397,
"index_time_in_millis": 3568991,
"index_current": 1,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 195961
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 55,
"query_time_in_millis": 294,
"query_current": 0,
"fetch_total": 36,
"fetch_time_in_millis": 3209,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 76,
"total_time_in_millis": 350987,
"total_docs": 45409,
"total_size_in_bytes": 4027595474,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 48633,
"total_auto_throttle_in_bytes": 82233108
},
"refresh": {
"total": 857,
"total_time_in_millis": 2994887,
"listeners": 0
},
"flush": {
"total": 15,
"total_time_in_millis": 291939
},
"warmer": {
"current": 0,
"total": 876,
"total_time_in_millis": 534
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 24808,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 139,
"memory_in_bytes": 186032131,
"terms_memory_in_bytes": 185758725,
"stored_fields_memory_in_bytes": 43976,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 77888,
"points_memory_in_bytes": 714,
"doc_values_memory_in_bytes": 150828,
"index_writer_memory_in_bytes": 1316180948,
"version_map_memory_in_bytes": 42250,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {
}
},
"translog": {
"operations": 11997,
"size_in_bytes": 5555179
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 195,
"miss_count": 195
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
},
"indices": {
"twitter": {
"primaries": {
"docs": {
"count": 26860,
"deleted": 0
},
"store": {
"size_in_bytes": 11027965678,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 27397,
"index_time_in_millis": 3568991,
"index_current": 1,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 195961
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 55,
"query_time_in_millis": 294,
"query_current": 0,
"fetch_total": 36,
"fetch_time_in_millis": 3209,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 76,
"total_time_in_millis": 350987,
"total_docs": 45409,
"total_size_in_bytes": 4027595474,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 48633,
"total_auto_throttle_in_bytes": 82233108
},
"refresh": {
"total": 857,
"total_time_in_millis": 2994887,
"listeners": 0
},
"flush": {
"total": 15,
"total_time_in_millis": 291939
},
"warmer": {
"current": 0,
"total": 876,
"total_time_in_millis": 534
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 24808,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 139,
"memory_in_bytes": 186032131,
"terms_memory_in_bytes": 185758725,
"stored_fields_memory_in_bytes": 43976,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 77888,
"points_memory_in_bytes": 714,
"doc_values_memory_in_bytes": 150828,
"index_writer_memory_in_bytes": 1316180948,
"version_map_memory_in_bytes": 42250,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {
}
},
"translog": {
"operations": 11997,
"size_in_bytes": 5555179
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 195,
"miss_count": 195
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 26860,
"deleted": 0
},
"store": {
"size_in_bytes": 11027965678,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 27397,
"index_time_in_millis": 3568991,
"index_current": 1,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 195961
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 55,
"query_time_in_millis": 294,
"query_current": 0,
"fetch_total": 36,
"fetch_time_in_millis": 3209,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 76,
"total_time_in_millis": 350987,
"total_docs": 45409,
"total_size_in_bytes": 4027595474,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 48633,
"total_auto_throttle_in_bytes": 82233108
},
"refresh": {
"total": 857,
"total_time_in_millis": 2994887,
"listeners": 0
},
"flush": {
"total": 15,
"total_time_in_millis": 291939
},
"warmer": {
"current": 0,
"total": 876,
"total_time_in_millis": 534
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 24808,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 139,
"memory_in_bytes": 186032131,
"terms_memory_in_bytes": 185758725,
"stored_fields_memory_in_bytes": 43976,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 77888,
"points_memory_in_bytes": 714,
"doc_values_memory_in_bytes": 150828,
"index_writer_memory_in_bytes": 1316180948,
"version_map_memory_in_bytes": 42250,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {
}
},
"translog": {
"operations": 11997,
"size_in_bytes": 5555179
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 195,
"miss_count": 195
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
}
}
}
编辑1 我发现了这个问题的根源。虽然我不知道为什么这似乎是错误的边界框。
一旦我从被摄取的数据中删除了边界框,索引就是正常尺寸(600个文档 - > 550kb),但只要我重新添加边界框(带有一个全新的索引),大小突飞猛进(3,593个文件 - > 1.6GB),只有84个文件包含一个边界框。
下面是边界框的JSON:
"placeBoundingBox": {
"type": "polygon",
"coordinates": [
[
[
-71.191421,
42.227797
],
[
-71.191421,
42.399542
],
[
-70.986004,
42.399542
],
[
-70.986004,
42.227797
],
[
-71.191421,
42.227797
]
]
]
}
与边界框关联的映射(来自调用GET / INDEX_NAME):
"placeBoundingBox": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "1.0m"
}
为了证明映射确实有效并且正在创建一个合适的geo_shape(即使Kibana不将其识别为geo_shape),我运行了以下查询并获得了成功的命中:
GET /_search
{
"query": {
"bool": {
"must": {
"match_all": {
}
},
"filter": {
"geo_shape": {
"placeBoundingBox": {
"shape": {
"type": "polygon",
"coordinates": [
[
[
-71.191421,
42.227797
],
[
-71.191421,
42.399542
],
[
-70.986004,
42.399542
],
[
-70.986004,
42.227797
],
[
-71.191421,
42.227797
]
]
]
},
"relation": "within"
}
}
}
}
}
}
我想保留边界框,是否有映射或数据有问题? 1.0米太精细了吗?
答案 0 :(得分:0)
问题在于映射的精确度,这只是一个错字(我们的Elasticsearch 2.x索引的精度为1km)。一封小信完全不同......
1米(" 1米")精度会产生极度膨胀的指数。
删除"精度"完全映射的字段将默认为50米和一个大小合适的索引。