我们正在使用Hazelcast 3.9.2
运行2个节点群集: Windows Server 2012 R2标准版 使用Oracle JAVA_VERSION =" 1.8.0_144"
从2个到20个客户端在不同的VM上运行: 3.10.0-327.28.3.el7.x86_64#1 SMP Fri Aug 12 13:21:05 EDT 2016 x86_64 x86_64 x86_64 GNU / Linux 与IBM JAVA_VERSION =" 1.7.1_64"
hazelcast.xml片段:
<map name="lock*">
<in-memory-format>BINARY</in-memory-format>
<statistics-enabled>true</statistics-enabled>
<backup-count>1</backup-count>
<eviction-policy>NONE</eviction-policy>
</map>
这是我们的hazelcast-client.xml
<hazelcast-client xmlns="http://www.hazelcast.com/schema/client-config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.hazelcast.com/schema/client-config file:///C:/caching/hazelcast-client-config-3.9.xsd">
<group>
<name>OUR_GROUP_NAME</name>
</group>
<properties>
<property name="hazelcast.client.shuffle.member.list">true</property>
<property name="hazelcast.client.heartbeat.timeout">60000</property>
<property name="hazelcast.client.heartbeat.interval">5000</property>
<property name="hazelcast.client.event.thread.count">10</property>
<property name="hazelcast.client.event.queue.capacity">1000000</property>
<property name="hazelcast.client.invocation.timeout.seconds">35</property>
<property name="hazelcast.client.statistics.enabled">true</property>
</properties>
<network>
<cluster-members>
<address>tvlcacheqa1.blqa.qa:5709</address>
<address>tvlcacheqa2.blqa.qa:5709</address>
</cluster-members>
<smart-routing>true</smart-routing>
<redo-operation>true</redo-operation>
<connection-attempt-period>15000</connection-attempt-period>
<connection-attempt-limit>1048576</connection-attempt-limit>
<socket-options>
<tcp-no-delay>false</tcp-no-delay>
<keep-alive>true</keep-alive>
<reuse-address>true</reuse-address>
<linger-seconds>5</linger-seconds>
<timeout>-1</timeout>
<buffer-size>64</buffer-size>
</socket-options>
</network>
<near-cache name="cache*">
<in-memory-format>OBJECT</in-memory-format>
<invalidate-on-change>true</invalidate-on-change>
<time-to-live-seconds>1800</time-to-live-seconds>
<max-idle-seconds>1800</max-idle-seconds>
<eviction eviction-policy="LRU" max-size-policy="ENTRY_COUNT" size="10000"/>
</near-cache>
上面没有提到的每个偏好都是DEFAULT,我们不会在代码中进行首选项更改。
在批处理中,我们同时在每个客户端上运行此代码:
synchronizer.startSyncSection(key, 100);
try {
doSomeCriticalStuff();
} finally {
synchronizer.endSyncSection(key);
}
这是我们基于Hazelcast Synchronizer
功能的IMap
实施:
@Override
public void startSynchedSection(MultiKey<?> key, long tryLockTimeoutInMs, long releaseLockTimeoutInMs) {
keyNullCheck(key);
tryLockTimeoutInMs = Math.max(tryLockTimeoutInMs, minimumObtainLockTimeoutInMs);
if (isClusterReady()) {
boolean locked = false;
try {
locked = this.locks.tryLock(key, tryLockTimeoutInMs, TimeUnit.MILLISECONDS, releaseLockTimeoutInMs, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
throw new TechnicalException(e);
}
if (!locked) {
throw new SyncTimeoutException(FAILED_TO_OBTAIN_EXCEPTION + key);
}
int lockCounter = incrementLockCounter(key);
} else {
throw new SyncTimeoutException(CLUSTER_NOT_READY_EXCEPTION);
}
}
/** Note This should be called in a FINALLY section!!! */
@Override
public void endSynchedSection(MultiKey<?> key) {
keyNullCheck(key);
int lockCounterBefore = getThreadLocalCounter(key.toString()).get();
if (lockCounterBefore == 0) {
return;
}
try {
int lockCounterAfter = decrementLockCounter(key);
if (this.locks.isLocked(key)) {
this.locks.unlock(key);
}
} catch (OperationTimeoutException e) {
this.logger.warn("endSynchedSection - Lock-> {} was not released properly in Hazelcast because of exception:\n{}\n in Thread={}", key, e
.getMessage(), Thread.currentThread().getName());
}
}
有时(通常在我们运行我们的批次时)线程卡在这个IMap调用上:
locked = this.locks.tryLock(key, tryLockTimeoutInMs, TimeUnit.MILLISECONDS, releaseLockTimeoutInMs, TimeUnit.MILLISECONDS);
其中this.locks
为private IMap<String, String> locks;
tryLockTimeoutInMs = 100ms
线程可能会挂2分钟!不幸的是,我们无法在测试环境中重现这种情况,但我们使用Dynatrace工具在生产中看到这样的报告: https://user-images.githubusercontent.com/12655866/39863775-74a3da5a-53fc-11e8-96b4-d55bea1f5e06.PNG
我浏览了每个集群成员&amp;客户登录并没有找到特别的东西。此时有任何警告或连接丢失。
IMap.tryLock(key, tryTime, TimeUnit.MILLISECONDS, leaseTime, TimeUnit.MILLISECONDS);
,即我自己解锁:if (this.locks.isLocked(key)) this.locks.unlock(key);
因此可能非常频繁地拨打IMap.isLocked(key)
和/或{ {1}}与IMap.unlock(key)
同时出现了这个原因?IMap.tryLock
正在使用特定于架构的代码(因此它不安全&#39;)并且我们在此方法中看到了挂起。有关于哪些建议?