Question

我们在其中一台机器上运行了大约40个JVM进程（3.2.0-4-amd64＃1 SMP Debian 3.2.78-1 x86_64 GNU / Linux）。服务器有32GB内存，每个进程消耗350到380MB内存。每个进程都承载一个Spring Boot应用程序。有时，我们会看到其中一个JVM因以下错误而崩溃。

#  SIGSEGV (0xb) at pc=0x00007f151891d5d0, pid=3049, tid=0x00007f14fa784700
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)

V  [libjvm.so+0x4685d0]  ClassLoaderData::metaspace_non_null()+0xc0
V  [libjvm.so+0x8a17e0]  Metaspace::allocate(ClassLoaderData*, unsigned long, bool, MetaspaceObj::Type, Thread*)+0x170
V  [libjvm.so+0x8b4b46]  MethodCounters::allocate(ClassLoaderData*, Thread*)+0x26
V  [libjvm.so+0x8ac8b1]  Method::build_method_counters(Method*, Thread*)+0x71
V  [libjvm.so+0x681adb]  InterpreterRuntime::build_method_counters(JavaThread*, Method*)+0x2b
j  sun.reflect.GeneratedSerializationConstructorAccessor18988.newInstance([Ljava/lang/Object;)Ljava/lang/Object;+0
J 10626 C2 org.springframework.aop.framework.CglibAopProxy.getProxy(Ljava/lang/ClassLoader;)Ljava/lang/Object; (405 bytes) @ 0x00007f1509f7cca8 [0x00007f1509f77ec0+0x4de8]
J 11391 C2 org.springframework.cloud.netflix.eureka.DataCenterAwareMarshallingStrategy$PublishingApplicationsConverter.unmarshal(Lcom/thoughtworks/xstream/io/HierarchicalStreamReader;Lcom/thoughtworks/xstream/converters/UnmarshallingContext;)Ljava/lang/Object; (39 bytes) @ 0x00007f1509a8fee4 [0x00007f1509a8f4c0+0xa24]
J 11352 C2 org.springframework.cloud.netflix.eureka.DataCenterAwareMarshallingStrategy.unmarshal(Ljava/lang/Object;Lcom/thoughtworks/xstream/io/HierarchicalStreamReader;Lcom/thoughtworks/xstream/converters/DataHolder;Lcom/thoughtworks/xstream/converters/ConverterLookup;Lcom/thoughtworks/xstream/mapper/Mapper;)Ljava/lang/Object; (30 bytes) @ 0x00007f1509a344cc [0x00007f1509a33da0+0x72c]
J 12523 C2 com.sun.jersey.api.client.ClientResponse.getEntity(Ljava/lang/Class;Ljava/lang/reflect/Type;)Ljava/lang/Object; (246 bytes) @ 0x00007f150a3cbf94 [0x00007f150a3cb0e0+0xeb4]
J 11327 C2 com.netflix.discovery.DiscoveryClient.fetchRegistry(Z)Z (409 bytes) @ 0x00007f1509b4e7f0 [0x00007f1509b4e400+0x3f0]
J 11332 C2 com.netflix.discovery.DiscoveryClient$CacheRefreshThread.run()V (353 bytes) @ 0x00007f1509b49004 [0x00007f1509b48f60+0xa4]
...

这种情况发生在不同服务的流程和不同的机器上，没有任何明显的原因。但它总是与我们看到的Eureka客户端试图从服务器解组响应的堆栈跟踪相同。但是，如果它只是因为

而失败，我们就不会这样做

代码中的一些奇怪的星座（Spring，cglib，xstream，Eureka，...）
这很可能发生在这里，因为发现客户端始终每隔30秒轮询一次服务器，并且只是在系统中遇到奇怪星座的可能候选者（内存分配，碎片......）

虽然我们使用的是Oracle JDK，但出于绝望，我已经检查了the OpenJDK implementation这种方法，但没有立即知道可能出现的问题。

我在甲骨文提交了一份错误报告并与他们交换了一些电子邮件，但除了说没有这个问题的复制者他们什么也做不了，我没有得到他们的答复。

所以我的问题是 - 在JVM中导致这样的错误的可能原因是什么？当我们之前在内存较少且可用内存大约为2％的系统上看到此错误时，我们怀疑内存碎片太高，但我发现新内存消耗仅为70％左右的新系统不太可能出现这种情况。除了JVM实现中的错误之外，还有其他针对此失败的其他解释吗？最重要的是 - 我们可以尝试可靠地重现这个错误吗？

Answer 1

It will be nice, if you provide full crash report. From this small stack trace I see one interesting part

sun.reflect.GeneratedSerializationConstructorAccessor18988

this means that you have 18988 generated accessors in your application ( I think it's much ). You can see here sun.reflect.MethodAccessorGenerator. When you use serialization or reflection, this can generate new classes ( sun.reflect.MethodAccessorGenerator#generate ) and then each class will be defined in DelegatingClassLoder, which will be created each time ( sun.reflect.ClassDefiner#defineClass).

Try to increase inflation threshold

-Dsun.reflect.inflationThreshold = some big big number

It will defer generating of new accessors.

Also you can view via JMX how many classes are unloaded ( ClassLoading MBean ), and also you can run

jcmd %pid% PerfCounter.print

and monitor counters sun.gc.metaspace.capacity sun.gc.metaspace.maxCapacity sun.gc.metaspace.used

什么可能导致JVM与SIGSEGV崩溃“ClassLoaderData :: metaspace_non_null（）”

1 个答案: