我正在尝试在Ignite群集上计算一批任务,其中节点使用作业分配策略。
一切正常,除了在已启动批处理的同时新节点加入群集时:此节点似乎无法窃取已运行的批处理的任何任务。我收到以下消息:
'SEVERE: Failed to send job stealing message to node: TcpDiscoveryNode [...]'
我认为这里已经存在一个问题:https://issues.apache.org/jira/browse/IGNITE-1267
此问题似乎已在线程中解决,但在Ignite 2.6.0中,问题仍然存在。
这是我的计算配置:
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setWaitJobsThreshold(1);
spi.setMessageExpireTime(1000);
spi.setMaximumStealingAttempts(10);
spi.setActiveJobsThreshold(1);
spi.setStealingEnabled(true);
JobStealingFailoverSpi failoverSpi = new JobStealingFailoverSpi();
cfg.setCollisionSpi(spi);
cfg.setFailoverSpi(failoverSpi);
Ignite ignite = Ignition.start(cfg);
我做错什么了吗?
EDIT:试图重现它,但是现在看来它可以按预期工作。这是一个非常奇怪的行为!
EDIT2:设法随机重现问题,这里是堆栈:
class org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[0:0:0:0:0:0:0:1%lo, 10.36.3.4, 127.0.0.1], sockAddrs=[/10.36.3.4:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1543917557221, loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2718)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2651)
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643)
at org.apache.ignite.internal.managers.communication.GridIoManager.sendToCustomTopic(GridIoManager.java:1703)
at org.apache.ignite.internal.managers.GridManagerAdapter$1.send(GridManagerAdapter.java:422)
at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.checkIdle(JobStealingCollisionSpi.java:1074)
at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.onCollision(JobStealingCollisionSpi.java:722)
at org.apache.ignite.internal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:119)
at org.apache.ignite.internal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:712)
at org.apache.ignite.internal.processors.job.GridJobProcessor.access$3000(GridJobProcessor.java:111)
at org.apache.ignite.internal.processors.job.GridJobProcessor$JobDiscoveryListener.onEvent(GridJobProcessor.java:2008)
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1384)
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:873)
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:858)
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:341)
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:307)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2703)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:2920)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2732)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[/10.36.3.4:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3422)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2958)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2841)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2692)
... 20 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
... 23 more
Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
... 23 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
... 23 more
Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
... 23 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
... 23 more
Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
... 23 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
... 23 more
Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
... 23 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
... 23 more
Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
... 23 more