Spark:序列化无法使用Aggregate

时间:2016-07-06 08:14:41

标签: java serialization apache-spark

我有这个类(在Java中),我想在Spark(1.6)中使用它:

public class Aggregation {
  private Map<String, Integer> counts;

  public Aggregation() {
    counts = new HashMap<String, Integer>();
  }

  public Aggregation add(Aggregation ia) {
    String key = buildCountString(ia);
    addKey(key);
    return this;
  }

  private void addKey(String key, int cnt) {
    if(counts.containsKey(key)) {
        counts.put(key, counts.get(key) + cnt);
    }
    else {
        counts.put(key, cnt);
    }
  }

  private void addKey(String key) {
    addKey(key, 1);
  }

  public Aggregation merge(Aggregation agg) {
    for(Entry<String, Integer> e: agg.counts.entrySet()) {
        this.addKey(e.getKey(), e.getValue());
    }
    return this;
  }

  private String buildCountString(Aggregation rec) {
    ...
  }
}

启动Spark时启用了Kyro并添加了这个类(在Scala中):

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Aggregation]))

我希望将它与Spark聚合一起使用(Scala):

rdd.aggregate(new InteractionAggregation)((agg, rec) => agg.add(rec), (a, b) => a.merge(b) )

不知怎的,这引发了一个&#34;任务不可序列化&#34;异常。

但是当我使用map with reduce时,一切正常:

val rdd2= interactionObjects.map( _ => new InteractionAggregation())
rdd2.reduce((a,b) => a.merge(b))
println(rdd2.count())

您是否知道为什么聚合发生错误而不是map / reduce?

谢谢和问候!

1 个答案:

答案 0 :(得分:1)

您的Aggregation类应该实现Serializable。当您调用aggregate时,驱动程序会将您的(新Aggregation())对象发送给所有worker,这会导致序列化错误。