我已成功使用GridGain超过3年,除了一些颠簸之外,它的工作非常顺利。至少我总能弄清楚出了什么问题(也是由于非常可靠的文档和示例)。好吧,直到现在......
对于我的一个项目,我试图在GridGain 6.5.0支持的计算网格中启用作业窃取。配置进行得很顺利,但是,我不时会得到GridComputeExecutionRejectedException,它会一直冒泡到客户端。奇怪的是,GridComputeExecutionRejectedException应该被标准GridComputeTaskAdapter(我扩展)的结果方法中提供的故障转移策略检测和路由:
public GridComputeJobResultPolicy result(GridComputeJobResult res, List<GridComputeJobResult> rcvd) throws GridException {
GridException e = res.getException();
// Try to failover if result is failed.
if (e != null) {
// Don't failover user's code errors.
if (e instanceof GridComputeExecutionRejectedException ||
e instanceof GridTopologyException ||
// Failover exception is always wrapped.
e.hasCause(GridComputeJobFailoverException.class))
return FAILOVER;
throw new GridException("Remote job threw user exception (override or implement GridComputeTask.result(..) " +
"method if you would like to have automatic failover for this exception).", e);
}
// Wait for all job responses.
return WAIT;
}
碰撞期间抛出的异常如下:
014-10-26 23:57:33,190 [http-bio-8080-exec-13] ERROR errors.GrailsExceptionResolver - GridComputeExecutionRejectedException occurred when processing request: [POST] /evoRun/runEvolution
Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414392425356, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, clsLdrId=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, startTime=1414392785621, endTime=9223372036854775807, taskNodeId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobId=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@380042f5]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
. Stacktrace follows:
class org.gridgain.grid.compute.GridComputeExecutionRejectedException: Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414392425356, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, clsLdrId=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, startTime=1414392785621, endTime=9223372036854775807, taskNodeId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobId=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@380042f5]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.onBeforeActivateJob(GridJobProcessor.java:1190)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$1500(GridJobProcessor.java:62)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$CollisionJobContext.activate(GridJobProcessor.java:1469)
at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.checkBusy(GridJobStealingCollisionSpi.java:640)
at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.onCollision(GridJobStealingCollisionSpi.java:589)
at org.gridgain.grid.kernal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:124)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:669)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1089)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
我还发现负责激活GridJobStealingCollisionSpi中的作业的代码有一条注释&#34;我们还需要确保该作业不会被另一个线程拒绝。&#34;可能是评论中描述的情景确实发生了吗? (我知道代码中有一个同步块可以阻止它。)
无论如何,我非常感谢任何帮助!
我的配置文件如下:
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd
http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util-3.1.xsd">
<bean id="grid.cfg" class="org.gridgain.grid.GridConfiguration">
<property name="marshaller">
<bean class="org.gridgain.grid.marshaller.optimized.GridOptimizedMarshaller">
<property name="requireSerializable" value="false"/>
</bean>
</property>
<property name="includeEventTypes">
<util:constant static-field="org.gridgain.grid.events.GridEventType.EVTS_TASK_EXECUTION"/>
</property>
<property name="discoverySpi">
<bean class="org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.gridgain.grid.spi.discovery.tcp.ipfinder.sharedfs.GridTcpDiscoverySharedFsIpFinder"/>
</property>
</bean>
</property>
<property name="loadBalancingSpi">
<bean class="org.gridgain.grid.spi.loadbalancing.adaptive.GridAdaptiveLoadBalancingSpi">
<property name="loadProbe">
<bean class="org.gridgain.grid.spi.loadbalancing.adaptive.GridAdaptiveProcessingTimeLoadProbe"/>
</property>
</bean>
</property>
<property name="collisionSpi">
<bean class="org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi">
<property name="activeJobsThreshold" value="28"/>
<property name="waitJobsThreshold" value="0"/>
<property name="messageExpireTime" value="3000"/>
<property name="maximumStealingAttempts" value="5"/>
<property name="stealingEnabled" value="true"/>
</bean>
</property>
<property name="failoverSpi">
<bean class="org.gridgain.grid.spi.failover.jobstealing.GridJobStealingFailoverSpi">
<property name="maximumFailoverAttempts" value="5"/>
</bean>
</property>
<property name="swapSpaceSpi">
<bean class="org.gridgain.grid.spi.swapspace.noop.GridNoopSwapSpaceSpi"/>
</property>
</bean>
</beans>
编辑:这里要求的是我的抽象任务类:
public abstract class GridCollectionInputTask<IN,OUT,JOB_OUT> extends GridComputeTaskSplitAdapter<Collection<IN>, OUT> {
/** Auto-injected grid logger. */
@GridLoggerResource
private GridLogger log = null;
private final ArgumentCallable<IN,JOB_OUT> callable;
public GridCollectionInputTask(ArgumentCallable<IN,JOB_OUT> callable) {
this.callable = callable;
}
@Override
protected Collection<? extends GridComputeJob> split(int gridSize, Collection<IN> inputs) throws GridException {
List<GridComputeJob> jobs = new ArrayList<GridComputeJob>(inputs.size());
for (IN input : inputs) {
jobs.add(new GridComputeJobAdapter(input) {
@SuppressWarnings("unchecked")
@Override
public JOB_OUT execute() {
return callable.call((IN) argument(0));
}
});
}
return jobs;
}
@Override
public OUT reduce(List<GridComputeJobResult> results) throws GridException {
Collection<JOB_OUT> jobResults = new ArrayList<JOB_OUT>();
for (GridComputeJobResult res : results)
jobResults.add((JOB_OUT) res.getData());
return createTaskOutput(jobResults);
}
protected abstract OUT createTaskOutput(Collection<JOB_OUT> jobResults);
}
编辑:在服务类(调用网格)中引入try-catch块之后,我得到了一个完整的堆栈,显然也出现了GridTopologyException:
2014-10-29 19:43:07,896 [http-bio-8080-exec-32] ERROR impl.EvolutionServiceImpl - Evolution run failed!
edu.banda.coel.CoelRuntimeException: 'GridFitnessEvaluatorBOTaskAdapter' failed on grid.
at edu.banda.coel.server.grid.ComputationalGrid.runOnGridSync(ComputationalGrid.java:231)
...
at edu.banda.coel.server.service.impl.EvolutionServiceImpl.evolve(EvolutionServiceImpl.java:125)
at com.banda.math.domain.evo.EvoRunController.runEvolution(EvoRunController.groovy:119)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: class org.gridgain.grid.GridTopologyException: Failed to failover a job to another node (failover SPI returned null) [job=edu.banda.coel.server.grid.GridCollectionInputTask$1@47ba5075, node=GridTcpDiscoveryNode [id=368ffe13-76c7-42f6-9339-a34c772c0931, addrs=[xxx.xxx.xxx.xxx, 127.0.0.1], sockAddrs=[xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:47500, /xxx.xxx.xxx.xxx:47500, /127.0.0.1:47500], discPort=47500, order=24, loc=false, ver=6.5.0#20140925-sha1:48190079]]
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.failover(GridTaskWorker.java:984)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:757)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:906)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1138)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
... 3 more
Caused by: class org.gridgain.grid.compute.GridComputeExecutionRejectedException: Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414636288878, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@684be8b8, clsLdrId=3bab4ee5941-368ffe13-76c7-42f6-9339-a34c772c0931, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=cc04ede5941-e05a00ce-2864-46a8-bf7c-4452f2a6d46e, startTime=1414636742023, endTime=9223372036854775807, taskNodeId=e05a00ce-2864-46a8-bf7c-4452f2a6d46e, clsLdr=sun.misc.Launcher$AppClassLoader@684be8b8, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=e05a00ce-2864-46a8-bf7c-4452f2a6d46e], jobId=21b4ede5941-368ffe13-76c7-42f6-9339-a34c772c0931], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@1886b071]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.onBeforeActivateJob(GridJobProcessor.java:1190)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$1500(GridJobProcessor.java:62)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$CollisionJobContext.activate(GridJobProcessor.java:1469)
at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.checkBusy(GridJobStealingCollisionSpi.java:640)
at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.onCollision(GridJobStealingCollisionSpi.java:589)
at org.gridgain.grid.kernal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:124)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:669)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$3000(GridJobProcessor.java:62)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobEventListener.onJobFinished(GridJobProcessor.java:1636)
at org.gridgain.grid.kernal.processors.job.GridJobWorker.finishJob(GridJobWorker.java:807)
at org.gridgain.grid.kernal.processors.job.GridJobWorker.execute0(GridJobWorker.java:533)
at org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:429)
... 4 more