SparkSQL查询在foreachpartition中抛出空指针异常

时间:2016-10-24 13:08:39

标签: apache-spark apache-spark-sql

请帮帮我。

我在s3中有10个(表数据)镶木地板文件。

我正在阅读并存储为数据集,然后注册为临时表。

一张表驱动整个流程,所以我在下面做。(当我从

触发查询时

代码库:

SparkSession spark = SparkSession.builder().appName("Test").getOrCreate();

Dataset<Row> citationDF = spark.read().parquet("s3://...")

...

...

citationDF.createOrReplaceTempView("citation");

...

....

cit_num.javaRDD().foreachPartition(new VoidFunction<Iterator<Row>>() 

{

      /**

* 

*/

private static final long serialVersionUID = 1L;



@Override

      public void call(Iterator<Row> iter) 

      {

        while (iter.hasNext()) 

        {

          Row record=iter.next();

          int citation_num=record.getInt(0);

          String ci_query="select queries ....";//(i can execute this query outside of foreach)

          System.out.println("citation num:"+citation_num+" count:"+spark.sql(ci_query).count());

          accum.add(1);

          System.out.println("accumulator count:"+accum);

        }

      }


});

错误:

16/10/24 09:08:12 WARN TaskSetManager: Lost task 1.0 in stage 30.0 (TID 83, ip-10-95-36-172.dev): java.lang.NullPointerException

at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)

at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)

at com.elsevier.datasearch.CitationTest$1.call(CitationTest.java:124)

at com.elsevier.datasearch.CitationTest$1.call(CitationTest.java:1)

at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)

at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

at org.apache.spark.scheduler.Task.run(Task.scala:85)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

0 个答案:

没有答案