Custom Accumulator不会使用Kryo Serialization进行Spark Streaming

时间:2017-12-13 11:54:56

标签: java apache-spark spark-streaming kryo

对于火花流媒体服务,我创建了一个自定义累加器,如下所示:

public class CustomAccumulator extends AccumulatorV2<CustomClass, Set<CustomClass>> {

    private Set<CustomClass> invalidSet = new HashSet<>();

    @Override
    public boolean isZero() {
        return invalidSet.isEmpty();
    }

    @Override
    public AccumulatorV2<CustomClass, Set<CustomClass>> copy() {
        return this;
    }

    @Override
    public void reset() {
        invalidSet.clear();
    }

    @Override
    public void add(CustomClass customClass) {
        invalidSet.add(customClass);

    }

    @Override
    public void merge(AccumulatorV2<CustomClass , Set<CustomClass >> accumulatorV2) {
        invalidSet.addAll(accumulatorV2.value());
    }

    @Override
    public Set<CustomClass> value() {
        return invalidSet;
    }
}

我一直在使用kryo序列化作为我的火花流媒体工作,这对于我的CustomClass广播工作得很好:

final Class[] serializableClasses = {CustomAccumulator.class, CustomClass.class};
    sparkConf.registerKryoClasses(serializableClasses);

但是,当worker将CustomClass对象添加到CustomAccumulator对象时,驱动程序将抛出以下错误:

[task-result-getter-1] ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0.0 in stage 4.0 (TID 25) had a not serializable result: com.project.CustomClass
2017-12-13 11:22:34,29493 Serialization stack:
2017-12-13 11:22:34,29499   - object not serializable (class: com.project.CustomClass, value: CustomClass{id=...})
2017-12-13 11:22:34,29503   - writeObject data (class: java.util.HashSet)
2017-12-13 11:22:34,29503   - object (class java.util.HashSet, [CustomClass{id=...}])
2017-12-13 11:22:34,29504   - field (class: com.project.CustomAccumulator, name: invalidSet, type: interface java.util.Set)
2017-12-13 11:22:34,29504   - object (class com.project.CustomAccumulator, CustomAccumulator(id: 0, name: Some(Custom Accumulator), value: [CustomClass{id=...}]))
2017-12-13 11:22:34,29504   - writeExternal data
2017-12-13 11:22:34,29505   - externalizable object (class org.apache.spark.scheduler.DirectTaskResult, org.apache.spark.scheduler.DirectTaskResult@2aa7e7e1); not retrying

当我强制我的CustomClass实现Serializable时,问题得到解决。但是,我希望与序列化保持一致,并使累加器使用已在spark上下文中注册的kryo序列化。有没有人知道如何利用kryo定制火花累积器?

0 个答案:

没有答案