我正在使用带有cassandra的spark,而且我有JavaRDD<String>
个客户。对于每个客户,我想从cassandra中选择这样的交互:
avaPairRDD<String, List<InteractionByMonthAndCustomer>> a = client.mapToPair(new PairFunction<String, String, List<InteractionByMonthAndCustomer>>() {
@Override
public Tuple2<String, List<InteractionByMonthAndCustomer>> call(String s) throws Exception {
List<InteractionByMonthAndCustomer> b = javaFunctions(sc)
.cassandraTable(CASSANDRA_SCHEMA, "interaction_by_month_customer")
.where("ctid =?", s)
.map(new Function<CassandraRow, InteractionByMonthAndCustomer>() {
@Override
public InteractionByMonthAndCustomer call(CassandraRow cassandraRow) throws Exception {
return new InteractionByMonthAndCustomer(cassandraRow.getString("channel"),
cassandraRow.getString("motif"),
cassandraRow.getDate("start"),
cassandraRow.getDate("end"),
cassandraRow.getString("ctid"),
cassandraRow.getString("month")
);
}
}).collect();
return new Tuple2<String, List<InteractionByMonthAndCustomer>>(s, b);
}
});
为此,我使用了一个JavaSparkContext sc
。但我得到了这个错误:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99)
at org.apache.spark.api.java.JavaRDD.mapToPair(JavaRDD.scala:32)
at fr.aid.cim.spark.dao.GenrateCustumorJourney.AllCleintInteractions(GenrateCustumorJourney.java:91)
at fr.aid.cim.spark.dao.GenrateCustumorJourney.main(GenrateCustumorJourney.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 14 more
我认为JavaSparkContext必须是可序列化的。但是我怎么能让它可以序列化呢?
谢谢。
答案 0 :(得分:14)
不,JavaSparkContext
不可序列化,不应该是。它不能用于您发送给远程工作人员的功能。在这里你没有明确地引用它,但是无论如何都要序列化引用,因为你的匿名内部类函数不是static
,因此它引用了封闭类。
尝试使用此功能将代码重写为static
独立对象。
答案 1 :(得分:0)
您不能使用SparkContext并在执行程序中创建其他RDD(RDD的map函数)。
您必须在驱动程序中创建Cassandra RDD(sc.cassandraTable),然后在这两个RDD(客户端RDD和cassandra表RDD)之间进行连接。
答案 2 :(得分:0)
使用transient
关键字声明它:
private transient JavaSparkContext sparkContext;