我无法在scala中并行化列表,得到java.lang.NullPointerException
messages.foreachRDD( rdd => {
for(avroLine <- rdd){
val record = Injection.injection.invert(avroLine.getBytes).get
val field1Value = record.get("username")
val jsonStrings=Seq(record.toString())
val newRow = sqlContext.sparkContext.parallelize(Seq(record.toString()))
}
})
输出
jsonStrings...List({"username": "user_118", "tweet": "tweet_218", "timestamp": 18})
异常
Caused by: java.lang.NullPointerException
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:83)
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:74)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
先谢谢!!
答案 0 :(得分:0)
您正尝试在spark worker上下文中创建RDD。虽然foreachRDD
在驱动程序中运行,但您在每个RDD上执行的foreach
操作将分发给工作程序。您似乎不太可能想要为输入流的每一行创建新的RDD。
评论后更新:
很难在评论主题中进行此讨论,其中没有代码格式。我的基本问题是你为什么不做这样的事情:
val messages: ReceiverInputDStream[String] = RabbitMQUtils.createStream(ssc, rabbitParams)
def toJsonString(message: String): String = SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get
val jsonStrings: DStream[String] = messages map toJsonString
我没有费心去弄清楚你所使用的所有图书馆(请下次提交MCVE),所以我还没有尝试编译那。但看起来你想要的只是将每个输入消息映射到JSON字符串。也许你想对得到的字符串DStream做一些奇特的事情,但这可能是一个不同的问题。
答案 1 :(得分:0)
def toJsonString(message: String): String = {val record =
SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get }
dStreams.foreachRDD( rdd => {
val jsonStrings = rdd.map (stream =>toJsonString(stream))
val df = sqlContext.read.json(jsonStrings)
df.write.mode("Append").csv("/Users/Documents/kafka-poc/consumer-out/def/")}