处理具有错误的并行区域时,分布式状态机的zookeeper集合失败KeeperErrorCode = BadVersion

时间:2018-05-14 12:06:44

标签: apache-zookeeper spring-statemachine apache-curator

背景:

图: Statemachine uml state diagram

我们有一个普通的状态机,如图所示,监控spring-BATCH微服务(部署在流源/处理器/接收器设计上),用于每个启动的批处理。

我们收到一系列REST调用,以便在相应批处理的机器对象上为每个批处理ID内部触发事件。即每个批次id创建新的状态机对象。

每台机器都有n个并行区域(代表弹簧批次的块),如图所示。

REST调用正在使用多线程环境,其中同一个batchId的2个同时调用可能来自BATCHPROCESSING状态的不同区域ID。

到目前为止,我们有一个单节点(单一安装)运行此状态机微服务,但现在我们想在多个实例上部署它;接收REST呼叫。 为此,我们要介绍分布式状态机。我们为运行分布式状态机配置了以下配置。

@Configuration
@EnableStateMachine
public  class StateMachineUMLWayConfiguration extends 
StateMachineConfigurerAdapter<String, String> {

..
..

@Override
public void configure(StateMachineModelConfigurer<String,String> model) 
throws Exception {
    model
        .withModel()
            .factory(stateMachineModelFactory());
}

@Bean
public StateMachineModelFactory<String,String> stateMachineModelFactory() {

    StorehubBatchUmlStateMachineModelFactory factory =null;

    try {
    factory = new StorehubBatchUmlStateMachineModelFactory
    (templateUMLInClasspath,stateMachineEnsemble());
    } catch (Exception e) {
    LOGGER.info("Config's State machine factory got exception 
    :"+factory);
    }
    LOGGER.info("Config's State machine factory method Called:"+factory);

factory.setStateMachineComponentResolver(stateMachineComponentResolver());
    return factory;
}


    @Override
    public void configure(StateMachineConfigurationConfigurer<String, 
String> 
    config) throws Exception {
    config
        .withDistributed()
            .ensemble(stateMachineEnsemble());
}

@Bean
public StateMachineEnsemble<String, String> stateMachineEnsemble() throws 
Exception {
    return new ZookeeperStateMachineEnsemble<String, String>(curatorClient(), "/batchfoo1", true, 512);
}

@Bean
    public CuratorFramework curatorClient() throws Exception {
        CuratorFramework client = 
CuratorFrameworkFactory.builder().defaultData(new byte[0])
                .retryPolicy(new ExponentialBackoffRetry(1000, 3))
                .connectString("localhost:2181").build();
        client.start();
        return client;
    }

StorehubBatchUmlStateMachineModelFactory的构建方法:

    @Override
    public StateMachineModel<String, String> build(String batchChunkId) {

    Model model = null;
    try {
        model = UmlUtils.getModel(getResourceUri(resolveResource(batchChunkId)).getPath());
    } catch (IOException e) {
        throw new IllegalArgumentException("Cannot build model from resource " + resource + " or location " + location, e);
    }
    UmlModelParser parser = new UmlModelParser(model, this);
    DataHolder dataHolder = parser.parseModel();
    ConfigurationData<String, String> configurationData = new ConfigurationData<String, String>( null, new SyncTaskExecutor(),
            new ConcurrentTaskScheduler() , false, stateMachineEnsemble,
            new ArrayList<StateMachineListener<String, String>>(), false,
            null, null,
            null, null, false,
            null , batchChunkId, null,
            null ) ;
    return new DefaultStateMachineModel<String, String>(configurationData, dataHolder.getStatesData(), dataHolder.getTransitionsData());
}

创建新的自定义服务接口级别方法来代替DefaultStateMachineService.acquireStateMachine(machineId)

@Override
public StateMachine<String, String> acquireDistributedStateMachine(String machineId, boolean start) {

    synchronized (distributedMachines) {
        DistributedStateMachine<String,String> distributedStateMachine = distributedMachines.get(machineId); 
        StateMachine<String,String> distMachineDelegateX = null;
        if (distributedStateMachine == null) { 

            StateMachine<String, String> machine = stateMachineFactory.getStateMachine(machineId);
            distributedStateMachine = (DistributedStateMachine<String, String>) machine;

        }
        distributedMachines.put(machineId, distributedStateMachine);

        return handleStart(distributedStateMachine, start);
    }
}

问题:

现在问题是,部署在单个实例上的微服务成功运行,即使它接收到的事件来自多线程环境,其中一个线程命中属于区域1的事件REST调用,同时其他线程来自同一区域2批次。机器与同步成功并行,成功的并行区域&#39;处理,直到最后状态,即BATCHCOMPLETED。 我们还在zookeeper方面检查了最后BATCHCOMPLETED STATE是否记录在节点的当前版本中。

但是,除了第一个实例之外,当我们在其他某个位置部署相同的微服务app-jar时,将其视为第二个微服务实例,现在也正在运行以接受事件REST调用(比如通过在另一个地方收听) tomcat port 9002);它在中间的某个地方随机失败。在触发并行区域中的任何一个事件之后以及在该事件的状态更改时内部调用ensemble.setState()时,会发生此故障。

它出现以下错误:

    [36mo.s.s.support.AbstractStateMachine      [0;39m [2m:[0;39m Interceptors threw exception, skipping state change

org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
        at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:241) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
        at org.springframework.statemachine.ensemble.DistributedStateMachine$LocalStateMachineInterceptor.preStateChange(DistributedStateMachine.java:209) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.StateMachineInterceptorList.preStateChange(StateMachineInterceptorList.java:101) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine.callPreStateChangeInterceptors(AbstractStateMachine.java:859) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine.switchToState(AbstractStateMachine.java:880) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine.access$500(AbstractStateMachine.java:81) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine$3.transit(AbstractStateMachine.java:335) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:286) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:211) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.processTriggerQueue(DefaultStateMachineExecutor.java:449) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.access$200(DefaultStateMachineExecutor.java:65) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor$1.run(DefaultStateMachineExecutor.java:323) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50) [spring-core-4.3.13.RELEASE.jar!/:4.3.13.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.scheduleEventQueueProcessing(DefaultStateMachineExecutor.java:352) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.DefaultStateMachineExecutor.execute(DefaultStateMachineExecutor.java:163) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine.sendEventInternal(AbstractStateMachine.java:603) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.support.AbstractStateMachine.sendEvent(AbstractStateMachine.java:218) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
        at org.springframework.statemachine.ensemble.DistributedStateMachine.sendEvent(DistributedStateMachine.java:108) 
..skipping Lines....
Caused by: org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
    at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:113) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
    at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:50) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
    at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:235) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
    ... 73 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)

问题:
1.所以上面提到的配置需要更多配置以避免上面提到的异常? 因为两个状态机微服务实例都是在它们都连接到同一个实例(即相同的字符串.connectString("localhost:2181").build()时的情况下进行测试的,或者当它们连接到不同的zookeeper实例时(即&#39; localhost: 2181&#39;,&#39; localhost:2182&#39;)。

在两种情况下状态机集合处理期间都会发生BAD VERSION的相同例外。

2.此外,如果批次并行运行,则需要创建各自的机器以在状态机微服务端并行运行。 所以这里,我们需要新的batchId技术上新的状态机,同时运行。 但是看看ZookeeperStateMachineEnsemble,只要在主配置类中实例化一次ensemble对象(&#34; StateMachineUMLWayConfiguration&#34;),One znode路径似乎与一个集合相关联。

那么只能使用该单例集合实例吗?是否可以在运行时创建多个集合,引用并行运行的不同znode路径,以将各自的分布式状态机状态记录到各自的znode路径?

a。因为并行运行的批处理需要创建单独的znode路径。因此,由于我们试图为每批保留单独的znode路径,我们需要为每批次的机器实例化单独的集合。但这似乎是在通过策展人客户端连接到znode时进入锁定状态。

b。为事件触发而触发的REST调用未完成,因为它获取的机器卡在整体连接中。

提前致谢。

0 个答案:

没有答案