对于火花流媒体服务,我创建了一个自定义累加器,如下所示:
public class CustomAccumulator extends AccumulatorV2<CustomClass, Set<CustomClass>> {
private Set<CustomClass> invalidSet = new HashSet<>();
@Override
public boolean isZero() {
return invalidSet.isEmpty();
}
@Override
public AccumulatorV2<CustomClass, Set<CustomClass>> copy() {
return this;
}
@Override
public void reset() {
invalidSet.clear();
}
@Override
public void add(CustomClass customClass) {
invalidSet.add(customClass);
}
@Override
public void merge(AccumulatorV2<CustomClass , Set<CustomClass >> accumulatorV2) {
invalidSet.addAll(accumulatorV2.value());
}
@Override
public Set<CustomClass> value() {
return invalidSet;
}
}
我一直在使用kryo序列化作为我的火花流媒体工作,这对于我的CustomClass广播工作得很好:
final Class[] serializableClasses = {CustomAccumulator.class, CustomClass.class};
sparkConf.registerKryoClasses(serializableClasses);
但是,当worker将CustomClass对象添加到CustomAccumulator对象时,驱动程序将抛出以下错误:
[task-result-getter-1] ERROR org.apache.spark.scheduler.TaskSetManager - Task 0.0 in stage 4.0 (TID 25) had a not serializable result: com.project.CustomClass
2017-12-13 11:22:34,29493 Serialization stack:
2017-12-13 11:22:34,29499 - object not serializable (class: com.project.CustomClass, value: CustomClass{id=...})
2017-12-13 11:22:34,29503 - writeObject data (class: java.util.HashSet)
2017-12-13 11:22:34,29503 - object (class java.util.HashSet, [CustomClass{id=...}])
2017-12-13 11:22:34,29504 - field (class: com.project.CustomAccumulator, name: invalidSet, type: interface java.util.Set)
2017-12-13 11:22:34,29504 - object (class com.project.CustomAccumulator, CustomAccumulator(id: 0, name: Some(Custom Accumulator), value: [CustomClass{id=...}]))
2017-12-13 11:22:34,29504 - writeExternal data
2017-12-13 11:22:34,29505 - externalizable object (class org.apache.spark.scheduler.DirectTaskResult, org.apache.spark.scheduler.DirectTaskResult@2aa7e7e1); not retrying
当我强制我的CustomClass实现Serializable时,问题得到解决。但是,我希望与序列化保持一致,并使累加器使用已在spark上下文中注册的kryo序列化。有没有人知道如何利用kryo定制火花累积器?