我们的生产Elasticsearch集群存在问题,Elasticsearch似乎随着时间的推移消耗了每台服务器上的所有RAM。每个盒子有128GB的RAM,所以我们运行两个实例,每个为JVM堆分配30GB。剩下的68G留给操作系统和Lucene。我们上周重新启动了每台服务器,并且每个Elasticsearch进程使用24%的RAM恰好启动了RAM。现在已经差不多一个星期,我们的内存消耗已经达到每个Elasticsearch实例大约40%。我附上了我们的配置文件,希望有人能够帮助弄清楚为什么Elasticsearch已经超出了我们为内存利用率设置的限制。
目前我们正在运行ES 1.3.2,但下周将通过我们的下一版本升级到1.4.2。
以下是重启后的顶部视图(为清晰起见,删除了额外的字段):
PID USER %MEM TIME+ 2178 elastics 24.1 1:03.49 2197 elastics 24.3 1:07.32
今天有一个:
PID USER %MEM TIME+ 2178 elastics 40.5 2927:50 2197 elastics 40.1 3000:44
elasticserach-0.yml:
cluster.name: PROD node.name: "PROD6-0" node.master: true node.data: true node.rack: PROD6 cluster.routing.allocation.awareness.force.rack.values: PROD4,PROD5,PROD6,PROD7,PROD8,PROD9,PROD10,PROD11,PROD12 cluster.routing.allocation.awareness.attributes: rack node.max_local_storage_nodes: 2 path.data: /es_data1 path.logs:/var/log/elasticsearch bootstrap.mlockall: true transport.tcp.port:9300 http.port: 9200 http.max_content_length: 400mb gateway.recover_after_nodes: 17 gateway.recover_after_time: 1m gateway.expected_nodes: 18 cluster.routing.allocation.node_concurrent_recoveries: 20 indices.recovery.max_bytes_per_sec: 200mb discovery.zen.minimum_master_nodes: 10 discovery.zen.ping.timeout: 3s discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: XXX index.search.slowlog.threshold.query.warn: 10s index.search.slowlog.threshold.query.info: 5s index.search.slowlog.threshold.query.debug: 2s index.search.slowlog.threshold.fetch.warn: 1s index.search.slowlog.threshold.fetch.info: 800ms index.search.slowlog.threshold.fetch.debug: 500ms index.indexing.slowlog.threshold.index.warn: 10s index.indexing.slowlog.threshold.index.info: 5s index.indexing.slowlog.threshold.index.debug: 2s monitor.jvm.gc.young.warn: 1000ms monitor.jvm.gc.young.info: 700ms monitor.jvm.gc.young.debug: 400ms monitor.jvm.gc.old.warn: 10s monitor.jvm.gc.old.info: 5s monitor.jvm.gc.old.debug: 2s action.auto_create_index: .marvel-* action.disable_delete_all_indices: true indices.cache.filter.size: 10% index.refresh_interval: -1 threadpool.search.type: fixed threadpool.search.size: 48 threadpool.search.queue_size: 10000000 cluster.routing.allocation.cluster_concurrent_rebalance: 6 indices.store.throttle.type: none index.reclaim_deletes_weight: 4.0 index.merge.policy.max_merge_at_once: 5 index.merge.policy.segments_per_tier: 5 marvel.agent.exporter.es.hosts: ["1.1.1.1:9200","1.1.1.1:9200"] marvel.agent.enabled: true marvel.agent.interval: 30s script.disable_dynamic: false
这是/ etc / sysconfig / elasticsearch-0:
# Directory where the Elasticsearch binary distribution resides ES_HOME=/usr/share/elasticsearch # Heap Size (defaults to 256m min, 1g max) ES_HEAP_SIZE=30g # Heap new generation #ES_HEAP_NEWSIZE= # max direct memory #ES_DIRECT_SIZE= # Additional Java OPTS #ES_JAVA_OPTS= # Maximum number of open files MAX_OPEN_FILES=65535 # Maximum amount of locked memory MAX_LOCKED_MEMORY=unlimited # Maximum number of VMA (Virtual Memory Areas) a process can own MAX_MAP_COUNT=262144 # Elasticsearch log directory LOG_DIR=/var/log/elasticsearch # Elasticsearch data directory DATA_DIR=/es_data1 # Elasticsearch work directory WORK_DIR=/tmp/elasticsearch # Elasticsearch conf directory CONF_DIR=/etc/elasticsearch # Elasticsearch configuration file (elasticsearch.yml) CONF_FILE=/etc/elasticsearch/elasticsearch-0.yml # User to run as, change this to a specific elasticsearch user if possible # Also make sure, this user can write into the log directories in case you change them # This setting only works for the init script, but has to be configured separately for systemd startup ES_USER=elasticsearch # Configure restart on package upgrade (true, every other setting will lead to not restarting) #RESTART_ON_UPGRADE=true
如果我能提供任何其他数据,请告诉我。提前感谢您的帮助。
total used free shared buffers cached Mem: 129022 119372 9650 0 219 46819 -/+ buffers/cache: 72333 56689 Swap: 28603 0 28603
答案 0 :(得分:0)
您所看到的不是堆爆发,堆将始终受限于您在配置中设置的内容。关于操作系统相关使用的free -m和top报告,因此在那里使用很可能是操作系统缓存FS调用。
这不会导致java OOM。
如果您遇到与空间不足的Java堆直接相关的java OOM,那么还有其他的东西在起作用。您的日志可能会提供一些信息。