我有一个pairRdd
不断获取数据,我想每x分钟将其内容输出到一个kafka主题,然后删除其内容。
我尝试了一些方法,但是每次出现此错误。
线程“ Timer-1”中的异常org.apache.spark.SparkException:任务不可序列化 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:298) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)中 在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2287) 在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:925) 在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:924) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:362) 在org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924) 在org.apache.spark.api.java.JavaRDDLike $ class.foreachPartition(JavaRDDLike.scala:219) 在org.apache.spark.api.java.AbstractJavaRDDLike.foreachPartition(JavaRDDLike.scala:45) 在SparkProcess $ 1.run(SparkProcess.java:94) 在java.base / java.util.TimerThread.mainLoop(Timer.java:556) 在java.base / java.util.TimerThread.run(Timer.java:506) 引起原因:java.io.NotSerializableException:SparkProcess $ 1 序列化堆栈: -对象不可序列化(类:SparkProcess $ 1,值:SparkProcess $ 1 @ 28612f9b) -数组元素(索引:0) -数组(类[Ljava.lang.Object ;,大小1) -字段(类:java.lang.invoke.SerializedLambda,名称:capturedArgs,类型:类[Ljava.lang.Object;) -object(class java.lang.invoke.SerializedLambda,SerializedLambda [capturingClass = class SparkProcess $ 1,functionalInterfaceMethod = org / apache / spark / api / java / function / VoidFunction.call:(Ljava / lang / Object;)V,Implementation = invokeSpecial SparkProcess $ 1.lambda $ run $ e3b46054 $ 1:(Ljava / util / Iterator;)V,instantiatedMethodType =(Ljava / util / Iterator;)V,numCaptured = 1]) -writeReplace数据(类:java.lang.invoke.SerializedLambda) -对象(类SparkProcess $ 1 $$ Lambda $ 105/77425562,SparkProcess $ 1 $$ Lambda $ 105/77425562 @ 346d59fc) -字段(类:org.apache.spark.api.java.JavaRDDLike $$ anonfun $ foreachPartition $ 1,名称:f $ 12,类型:接口org.apache.spark.api.java.function.VoidFunction) -对象(org.apache.spark.api.java.JavaRDDLike $$ anonfun $ foreachPartition $ 1,类) 在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)处 在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) 在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295) ...另外14个
我当前的功能如下:
public void cacheToTopic(){
Timer t = new Timer();
t.scheduleAtFixedRate(
new TimerTask()
{
public void run()
{
pairRdd.foreachPartition(record->{
Producer<String, String> producer=createKafkaProducer();
ProducerRecord<String, String> data = new ProducerRecord<String, String>("output"
, DataObjectFactory.getRawJSON(record));
producer.send(data);
});
}
},
3000, // run first occurrence after three seconds
3000); // run every three seconds
}