SQLContext.read()Spark中的NullPointerException

时间:2016-08-10 11:13:36

标签: java json apache-spark avro

我正在尝试使用SQLContext.read()在Spark中读取由Kafka生成的JSON记录。每次出现NullPointerException。

    SparkConf conf = new SparkConf()
       .setAppName("kafka-sandbox")
        .setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));

    Set<String> topics = Collections.singleton(topicString);
    Map<String, String> kafkaParams = new HashMap<>();
    kafkaParams.put("metadata.broker.list", servers);

    JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(
            ssc, String.class, String.class, StringDecoder.class, StringDecoder.class, 
            kafkaParams, topics);
    SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

    directKafkaStream
        .map(message -> message._2)
        .foreachRDD(rdd -> {
            rdd.foreach(record -> {
                Dataset<Row> ds = sqlContext.read().json(rdd);
            });
         });
    ssc.start();
    ssc.awaitTermination();

这是一个日志:

java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
    at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
    at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:535)
    at org.apache.spark.sql.SparkSession.read(SparkSession.scala:595)
    at org.apache.spark.sql.SQLContext.read(SQLContext.scala:504)
    at SparkJSONConsumer$1.lambda$2(SparkJSONConsumer.java:73)
    at SparkJSONConsumer$1$$Lambda$8/1821075039.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我认为问题是由foreachRDD子句引起的,但是无法弄明白。所以任何建议都会很棒。

另外,我正在使用sqlContext,因为我计划以avro格式(“com.databricks.spark.avro”)序列化记录。如果有一种方法可以将包含JSON结构的字符串序列化为avro格式而不定义模式,那么非常欢迎您分享它!

提前致谢。

1 个答案:

答案 0 :(得分:1)

如Spark文档中所述 - 您必须使用StreamingContext正在使用的SparkContext创建SparkSession。此外,必须这样做,以便可以在驱动器故障时重新启动。这是通过创建一个延迟实例化的SparkSession单例实例来完成的。

<强>参见:

http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#dataframe-and-sql-operations

<强>解决方案:

在读取json之前创建如下所示的SQLContext。

SQLContext sqlContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext


Dataset<Row> ds = sqlContext.read().json(rdd);