我正在尝试使用SQLContext.read()在Spark中读取由Kafka生成的JSON记录。每次出现NullPointerException。
SparkConf conf = new SparkConf()
.setAppName("kafka-sandbox")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
Set<String> topics = Collections.singleton(topicString);
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", servers);
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(
ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
kafkaParams, topics);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
directKafkaStream
.map(message -> message._2)
.foreachRDD(rdd -> {
rdd.foreach(record -> {
Dataset<Row> ds = sqlContext.read().json(rdd);
});
});
ssc.start();
ssc.awaitTermination();
这是一个日志:
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:535)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:595)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:504)
at SparkJSONConsumer$1.lambda$2(SparkJSONConsumer.java:73)
at SparkJSONConsumer$1$$Lambda$8/1821075039.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我认为问题是由foreachRDD子句引起的,但是无法弄明白。所以任何建议都会很棒。
另外,我正在使用sqlContext,因为我计划以avro格式(“com.databricks.spark.avro”)序列化记录。如果有一种方法可以将包含JSON结构的字符串序列化为avro格式而不定义模式,那么非常欢迎您分享它!
提前致谢。
答案 0 :(得分:1)
如Spark文档中所述 - 您必须使用StreamingContext正在使用的SparkContext创建SparkSession。此外,必须这样做,以便可以在驱动器故障时重新启动。这是通过创建一个延迟实例化的SparkSession单例实例来完成的。
<强>参见:强>
http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#dataframe-and-sql-operations
<强>解决方案:强>
在读取json之前创建如下所示的SQLContext。
SQLContext sqlContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext
Dataset<Row> ds = sqlContext.read().json(rdd);