Question

我们在一个非常高容量的网站上使用了一组tomcat服务器。我们注意到，在应用程序重新加载后的前5-6小时，Full GC将每2分钟运行一次，暂停应用程序5到20秒之间的任何时间。 5-6小时后，在重新启动tomcat之前，完整的GC将不再运行。流量水平不是我们在没有问题的情况下经历高峰时段的因素。

服务器都是双四核，32GB的RAM运行Centos 5.我们的java opts每天都在玩，但下面的示例GC日志对应的设置如下：

-server -Xmx27g -Xms27g  -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution  -Dsun.rmi.dgc.client.gcInterval=900000 -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

在应用程序重新加载后不久记录样本

191.955: [Full GC 191.958: [CMS: 1815877K->1158107K(19922944K), 3.0376720 secs] 3118102K->1158107K(27845568K), [CMS Perm : 83787K->46767K(83960K)], 3.0415080 secs] [Times: user=2.95 sys=0.10, real=3.04 secs]

215.501: [GC 215.504: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   50457968 bytes,   50457968 total
: 7456799K->111048K(7922624K), 0.0617110 secs] 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 sys=0.00, real=0.07 secs]

215.577: [GC 215.579: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:      66288 bytes,      66288 total
- age   2:   50219144 bytes,   50285432 total
: 114868K->66525K(7922624K), 0.0381810 secs] 1272975K->1224632K(27845568K), 0.0413630 secs] [Times: user=0.46 sys=0.00, real=0.04 secs]

236.177: [GC 236.180: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   45071064 bytes,   45071064 total
- age   2:      26112 bytes,   45097176 total
- age   3:   34785960 bytes,   79883136 total
: 7523165K->110355K(7922624K), 0.0921350 secs] 8681272K->1268462K(27845568K), 0.0969290 secs] [Times: user=0.95 sys=0.01, real=0.10 secs]

...

316.456: [GC 316.459: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   41430416 bytes,   41430416 total
- age   3:   22728376 bytes,   64158792 total
- age   5:   19599960 bytes,   83758752 total
- age   6:   21847616 bytes,  105606368 total
- age   7:   27667592 bytes,  133273960 total
- age   8:      10904 bytes,  133284864 total
- age   9:   31824256 bytes,  165109120 total
: 7650333K->215213K(7922624K), 0.1332630 secs] 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 sys=0.01, real=0.14 secs]

338.851: [GC 338.854: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   40678840 bytes,   40678840 total
- age   2:   27075936 bytes,   67754776 total
- age   4:   20399720 bytes,   88154496 total
- age   6:   19271008 bytes,  107425504 total
- age   7:   21655032 bytes,  129080536 total
- age   8:   27118800 bytes,  156199336 total
- age   9:      10904 bytes,  156210240 total
- age  10:   31747808 bytes,  187958048 total
: 7671853K->285541K(7922624K), 0.1456470 secs] 8829960K->1443648K(27845568K), 0.1503540 secs] [Times: user=1.62 sys=0.01, real=0.15 secs]

343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm : 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87 sys=0.02, real=3.41 secs]

**Last Full GC**

20517.892: [GC 20517.898: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   33948208 bytes,   33948208 total
- age   2:      88280 bytes,   34036488 total
- age   3:   19872472 bytes,   53908960 total
- age   4:   16072608 bytes,   69981568 total
- age   5:   15718712 bytes,   85700280 total
- age   6:   15771016 bytes,  101471296 total
- age   7:   16895976 bytes,  118367272 total
- age   8:   24233728 bytes,  142601000 total
: 7618727K->200950K(7922624K), 0.1728420 secs] 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 sys=0.01, real=0.18 secs]

20526.469: [Full GC 20526.475: [CMS: 9175755K->9210800K(19922944K), 33.1161300 secs] 13632232K->9210800K(27845568K), [CMS Perm : 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 sys=0.02, real=33.12 secs]


**Log samples after Full GC no longer runs**

74412.335: [GC 74412.340: [ParNew
Desired survivor size 238583808 bytes, new threshold 11 (max 15)
- age   1:   43614032 bytes,   43614032 total
- age   2:   41194144 bytes,   84808176 total
- age   3:   27392888 bytes,  112201064 total
- age   5:   22753896 bytes,  134954960 total
- age   7:   24439608 bytes,  159394568 total
- age   8:   24015704 bytes,  183410272 total
- age   9:   24080848 bytes,  207491120 total
- age  10:   24715800 bytes,  232206920 total
- age  11:   21844024 bytes,  254050944 total
: 7813778K->312911K(7922624K), 0.3329150 secs] 24426351K->16967791K(27845568K), 0.3416730 secs] [Times: user=3.69 sys=0.02, real=0.35 secs]

74445.007: [GC 74445.012: [ParNew
Desired survivor size 238583808 bytes, new threshold 11 (max 15)
- age   1:   42690688 bytes,   42690688 total
- age   2:   37055848 bytes,   79746536 total
- age   3:   37107464 bytes,  116854000 total
- age   4:   26223088 bytes,  143077088 total
- age   6:   22478672 bytes,  165555760 total
- age   8:   24259744 bytes,  189815504 total
- age   9:   23862672 bytes,  213678176 total
- age  10:   23911864 bytes,  237590040 total
- age  11:   24496888 bytes,  262086928 total
: 7769547K->344030K(7922624K), 0.3088470 secs] 24424428K->17021685K(27845568K), 0.3175830 secs] [Times: user=3.57 sys=0.01, real=0.32 secs]

74475.169: [GC 74475.175: [ParNew
Desired survivor size 238583808 bytes, new threshold 10 (max 15)
- age   1:   42011656 bytes,   42011656 total
- age   2:   33147608 bytes,   75159264 total
- age   3:   32391640 bytes,  107550904 total
- age   4:   36516584 bytes,  144067488 total
- age   5:   25940856 bytes,  170008344 total
- age   7:   22037464 bytes,  192045808 total
- age   9:   24130040 bytes,  216175848 total
- age  10:   23724672 bytes,  239900520 total
- age  11:   23329640 bytes,  263230160 total
: 7803184K->331046K(7922624K), 0.3091600 secs] 24480839K->17033619K(27845568K), 0.3179630 secs] [Times: user=3.56 sys=0.01, real=0.32 secs]

如果我们下周没有重新启动此服务器，它将运行良好。

非常感谢任何帮助。

编辑：只有重新部署WAR文件时才会出现此问题。在它自己重启tomcat不会导致这个问题。

Answer 1

我会尝试使用安装了VisualGC插件的jvisualvm连接到其中一个实例。它将显示JVM中每个池的内存初始分布以及它如何随时间变化。还有一个内存采样器和分析器，可用于确定给定时间的内存状态。

另外我不确定你是怎么想出你正在使用的JVM参数的（27Gb？你在进程中保留了某种内存缓存吗？）但我通常会从最低限度开始然后调整它只有在确定问题后（例如，小型新池等）。试着用以下内容开始吧：

-Xmx.. -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -Xloggc:/tmp/gc.log -XX:+HeapDumpOnOutOfMemoryError

在你的内存量上，java应该自动以“服务器”模式启动，并且只要有足够的内存，它通常非常适合动态分配所需的内存池。

Answer 2

关于这里可能发生的事情的理论可能来自你的ehcache的统计数据。

了解CMS收集器何时启动完整GC非常重要。以下是[Reference]

与其他收藏家不同，CMS收藏家不会开始陈旧当老一代充满时的世代集合。代替，它试图尽早开始一个集合，以便它可以完成在那之前发生

基本上，CMS收集器根据之前的级别和填充速度决定何时运行GC。这样做是为了减少将来的暂停时间。

因此，当您在应用程序启动后的早期看到所有这些完整集合时，JVM可能会确定它已经分配了大量内存，因此它正在运行频繁的GC来保护自己不会到达会发生OOM错误。如果您查看GC的统计信息，则第一个完整集合将在仅消耗1.8g的27gb tenured堆时启动。最后一次发生在27gb的9.2gb。

此时，当完整的GC停止时，收集器已确定它没有受到压力，并且内存分配已经稳定了一些。是否有可能在5-6小时标记时，应用程序缓存已完全填充，并且没有为其需求分配更多内存。您可以创建一个工具来查看点击次数，未命中，缓存大小随时间变化的统计信息，并以这种方式监控其大小。在他们停止增长的某个时刻，您可以看到它们是否与GC停止的时间一致。就个人而言，我只使用了自己种植的工具，但您可以尝试使用其网站上提供的EHCache Monitor工具。

此外，您是否通过任何工具（例如IBM Diagnostic tools或MAT）运行GC日志，以获取应用程序在此期间获得的吞吐量细分。使用CMS收集器并非所有暂停都是停止世界，所以一些暂停时间可能比你想象的更快

应用程序运行时间的前5个小时每2分钟使用Tomcat完整GC

2 个答案: