Spark Rdd输出到Kafka主题

时间:2018-07-03 14:06:05

标签: java apache-spark apache-kafka

我有一个pairRdd不断获取数据,我想每x分钟将其内容输出到一个kafka主题,然后删除其内容。

我尝试了一些方法,但是每次出现此错误。

  

线程“ Timer-1”中的异常org.apache.spark.SparkException:任务不可序列化       在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:298)       在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)中       在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108)       在org.apache.spark.SparkContext.clean(SparkContext.scala:2287)       在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:925)       在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:924)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)       在org.apache.spark.rdd.RDD.withScope(RDD.scala:362)       在org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)       在org.apache.spark.api.java.JavaRDDLike $ class.foreachPartition(JavaRDDLike.scala:219)       在org.apache.spark.api.java.AbstractJavaRDDLike.foreachPartition(JavaRDDLike.scala:45)       在SparkProcess $ 1.run(SparkProcess.java:94)       在java.base / java.util.TimerThread.mainLoop(Timer.java:556)       在java.base / java.util.TimerThread.run(Timer.java:506)   引起原因:java.io.NotSerializableException:SparkProcess $ 1   序列化堆栈:       -对象不可序列化(类:SparkProcess $ 1,值:SparkProcess $ 1 @ 28612f9b)       -数组元素(索引:0)       -数组(类[Ljava.lang.Object ;,大小1)       -字段(类:java.lang.invoke.SerializedLambda,名称:capturedArgs,类型:类[Ljava.lang.Object;)       -object(class java.lang.invoke.SerializedLambda,SerializedLambda [capturingClass = class SparkProcess $ 1,functionalInterfaceMethod = org / apache / spark / api / java / function / VoidFunction.call:(Ljava / lang / Object;)V,Implementation = invokeSpecial SparkProcess $ 1.lambda $ run $ e3b46054 $ 1:(Ljava / util / Iterator;)V,instantiatedMethodType =(Ljava / util / Iterator;)V,numCaptured = 1])       -writeReplace数据(类:java.lang.invoke.SerializedLambda)       -对象(类SparkProcess $ 1 $$ Lambda $ 105/77425562,SparkProcess $ 1 $$ Lambda $ 105/77425562 @ 346d59fc)       -字段(类:org.apache.spark.api.java.JavaRDDLike $$ anonfun $ foreachPartition $ 1,名称:f $ 12,类型:接口org.apache.spark.api.java.function.VoidFunction)       -对象(org.apache.spark.api.java.JavaRDDLike $$ anonfun $ foreachPartition $ 1,类)       在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)处       在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)       在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)       在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295)       ...另外14个

我当前的功能如下:

   public void cacheToTopic(){

    Timer t = new Timer();
    t.scheduleAtFixedRate(
            new TimerTask()
            {
                public void run()
                {                         

                    pairRdd.foreachPartition(record->{
                        Producer<String, String> producer=createKafkaProducer();
                        ProducerRecord<String, String> data = new ProducerRecord<String, String>("output"
                                , DataObjectFactory.getRawJSON(record));

                        producer.send(data);

                    });
                }
            },
            3000,      // run first occurrence after three seconds
            3000);  // run every three seconds
}

0 个答案:

没有答案