使用GridJobStealingCollisionSpi时未处理GridComputeExecutionRejectedException

时间:2014-10-27 08:17:40

标签: gridgain

我已成功使用GridGain超过3年,除了一些颠簸之外,它的工作非常顺利。至少我总能弄清楚出了什么问题(也是由于非常可靠的文档和示例)。好吧,直到现在......

对于我的一个项目,我试图在GridGain 6.5.0支持的计算网格中启用作业窃取。配置进行得很顺利,但是,我不时会得到GridComputeExecutionRejectedException,它会一直冒泡到客户端。奇怪的是,GridComputeExecutionRejectedException应该被标准GridComputeTaskAdapter(我扩展)的结果方法中提供的故障转移策略检测和路由:

public GridComputeJobResultPolicy result(GridComputeJobResult res, List<GridComputeJobResult> rcvd) throws GridException {

    GridException e = res.getException();

    // Try to failover if result is failed.
    if (e != null) {
        // Don't failover user's code errors.
        if (e instanceof GridComputeExecutionRejectedException ||
            e instanceof GridTopologyException ||
            // Failover exception is always wrapped.
            e.hasCause(GridComputeJobFailoverException.class))
            return FAILOVER;

        throw new GridException("Remote job threw user exception (override or implement GridComputeTask.result(..) " +
        "method if you would like to have automatic failover for this exception).", e);
    }

    // Wait for all job responses.
    return WAIT;
}

碰撞期间抛出的异常如下:

014-10-26 23:57:33,190 [http-bio-8080-exec-13] ERROR errors.GrailsExceptionResolver  - GridComputeExecutionRejectedException occurred when processing request: [POST] /evoRun/runEvolution
Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414392425356, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, clsLdrId=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, startTime=1414392785621, endTime=9223372036854775807, taskNodeId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobId=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@380042f5]
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation
. Stacktrace follows:
class org.gridgain.grid.compute.GridComputeExecutionRejectedException: Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414392425356, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, clsLdrId=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, startTime=1414392785621, endTime=9223372036854775807, taskNodeId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsLdr=sun.misc.Launcher$AppClassLoader@2e2e1b6c, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobId=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@380042f5]
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.onBeforeActivateJob(GridJobProcessor.java:1190)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$1500(GridJobProcessor.java:62)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor$CollisionJobContext.activate(GridJobProcessor.java:1469)
    at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.checkBusy(GridJobStealingCollisionSpi.java:640)
    at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.onCollision(GridJobStealingCollisionSpi.java:589)
    at org.gridgain.grid.kernal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:124)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:669)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1089)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

我还发现负责激活GridJobStealingCollisionSpi中的作业的代码有一条注释&#34;我们还需要确保该作业不会被另一个线程拒绝。&#34;可能是评论中描述的情景确实发生了吗? (我知道代码中有一个同步块可以阻止它。)

无论如何,我非常感谢任何帮助!

我的配置文件如下:

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:util="http://www.springframework.org/schema/util"
       xsi:schemaLocation="
        http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd
    http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util-3.1.xsd">

    <bean id="grid.cfg" class="org.gridgain.grid.GridConfiguration">

        <property name="marshaller">
            <bean class="org.gridgain.grid.marshaller.optimized.GridOptimizedMarshaller">
                <property name="requireSerializable" value="false"/>
            </bean>
        </property>

        <property name="includeEventTypes">
            <util:constant static-field="org.gridgain.grid.events.GridEventType.EVTS_TASK_EXECUTION"/>
        </property>

        <property name="discoverySpi">
            <bean class="org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi">
                <property name="ipFinder">
            <bean class="org.gridgain.grid.spi.discovery.tcp.ipfinder.sharedfs.GridTcpDiscoverySharedFsIpFinder"/>
                </property>
            </bean>
        </property>

    <property name="loadBalancingSpi">
        <bean class="org.gridgain.grid.spi.loadbalancing.adaptive.GridAdaptiveLoadBalancingSpi">
            <property name="loadProbe">
                <bean class="org.gridgain.grid.spi.loadbalancing.adaptive.GridAdaptiveProcessingTimeLoadProbe"/> 
            </property>
        </bean>
    </property>

    <property name="collisionSpi">
        <bean class="org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi">
            <property name="activeJobsThreshold" value="28"/>
            <property name="waitJobsThreshold" value="0"/>
                <property name="messageExpireTime" value="3000"/>
                <property name="maximumStealingAttempts" value="5"/>
                <property name="stealingEnabled" value="true"/>
            </bean>
        </property>

    <property name="failoverSpi">
        <bean class="org.gridgain.grid.spi.failover.jobstealing.GridJobStealingFailoverSpi">
            <property name="maximumFailoverAttempts" value="5"/>
        </bean>
        </property>

        <property name="swapSpaceSpi">
            <bean class="org.gridgain.grid.spi.swapspace.noop.GridNoopSwapSpaceSpi"/>
        </property>
    </bean>
</beans>

编辑:这里要求的是我的抽象任务类:

public abstract class GridCollectionInputTask<IN,OUT,JOB_OUT> extends GridComputeTaskSplitAdapter<Collection<IN>, OUT> {

    /** Auto-injected grid logger. */
    @GridLoggerResource
    private GridLogger log = null;

    private final ArgumentCallable<IN,JOB_OUT> callable;

    public GridCollectionInputTask(ArgumentCallable<IN,JOB_OUT> callable) {
        this.callable = callable;
    }

    @Override
    protected Collection<? extends GridComputeJob> split(int gridSize, Collection<IN> inputs) throws GridException {
      List<GridComputeJob> jobs = new ArrayList<GridComputeJob>(inputs.size());

      for (IN input : inputs) {
          jobs.add(new GridComputeJobAdapter(input) {

            @SuppressWarnings("unchecked")
            @Override
            public JOB_OUT execute() {
                return callable.call((IN) argument(0));
              }
          });
      }
      return jobs;
    }

    @Override
    public OUT reduce(List<GridComputeJobResult> results) throws GridException {
        Collection<JOB_OUT> jobResults = new ArrayList<JOB_OUT>();
        for (GridComputeJobResult res : results)
            jobResults.add((JOB_OUT) res.getData());
        return createTaskOutput(jobResults);
    }

    protected abstract OUT createTaskOutput(Collection<JOB_OUT> jobResults);
}

编辑:在服务类(调用网格)中引入try-catch块之后,我得到了一个完整的堆栈,显然也出现了GridTopologyException:

2014-10-29 19:43:07,896 [http-bio-8080-exec-32] ERROR impl.EvolutionServiceImpl  - Evolution run failed!
edu.banda.coel.CoelRuntimeException: 'GridFitnessEvaluatorBOTaskAdapter' failed on grid.
    at edu.banda.coel.server.grid.ComputationalGrid.runOnGridSync(ComputationalGrid.java:231)
        ...
    at edu.banda.coel.server.service.impl.EvolutionServiceImpl.evolve(EvolutionServiceImpl.java:125)
    at com.banda.math.domain.evo.EvoRunController.runEvolution(EvoRunController.groovy:119)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: class org.gridgain.grid.GridTopologyException: Failed to failover a job to another node (failover SPI returned null) [job=edu.banda.coel.server.grid.GridCollectionInputTask$1@47ba5075, node=GridTcpDiscoveryNode [id=368ffe13-76c7-42f6-9339-a34c772c0931, addrs=[xxx.xxx.xxx.xxx, 127.0.0.1], sockAddrs=[xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:47500, /xxx.xxx.xxx.xxx:47500, /127.0.0.1:47500], discPort=47500, order=24, loc=false, ver=6.5.0#20140925-sha1:48190079]]
    at org.gridgain.grid.kernal.processors.task.GridTaskWorker.failover(GridTaskWorker.java:984)
    at org.gridgain.grid.kernal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:757)
    at org.gridgain.grid.kernal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:906)
    at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1138)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
    at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    ... 3 more
Caused by: class org.gridgain.grid.compute.GridComputeExecutionRejectedException: Job was cancelled before execution [jobSes=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=edu.banda.coel.server.grid.GridCollectionTask, dep=LocalDeployment [super=GridDeployment [ts=1414636288878, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@684be8b8, clsLdrId=3bab4ee5941-368ffe13-76c7-42f6-9339-a34c772c0931, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=edu.banda.coel.server.grid.GridCollectionTask, sesId=cc04ede5941-e05a00ce-2864-46a8-bf7c-4452f2a6d46e, startTime=1414636742023, endTime=9223372036854775807, taskNodeId=e05a00ce-2864-46a8-bf7c-4452f2a6d46e, clsLdr=sun.misc.Launcher$AppClassLoader@684be8b8, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=e05a00ce-2864-46a8-bf7c-4452f2a6d46e], jobId=21b4ede5941-368ffe13-76c7-42f6-9339-a34c772c0931], job=edu.banda.coel.server.grid.GridCollectionInputTask$1@1886b071]
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.onBeforeActivateJob(GridJobProcessor.java:1190)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$1500(GridJobProcessor.java:62)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor$CollisionJobContext.activate(GridJobProcessor.java:1469)
    at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.checkBusy(GridJobStealingCollisionSpi.java:640)
    at org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi.onCollision(GridJobStealingCollisionSpi.java:589)
    at org.gridgain.grid.kernal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:124)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:669)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor.access$3000(GridJobProcessor.java:62)
    at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobEventListener.onJobFinished(GridJobProcessor.java:1636)
    at org.gridgain.grid.kernal.processors.job.GridJobWorker.finishJob(GridJobWorker.java:807)
    at org.gridgain.grid.kernal.processors.job.GridJobWorker.execute0(GridJobWorker.java:533)
    at org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:429)
    ... 4 more

0 个答案:

没有答案