spark2.0.1 - 分区发现失败

时间:2016-10-16 06:10:19

标签: partitioning parquet apache-spark-2.0

我在分区目录中有以下镶木地板文件:

/files/dataset
  /id=1
       parquet.gz 
  /id=2
       parquet.gz      
  /id=3
       parquet.gz

在spark1.6中,可以按如下方式访问它们:

val arr = sqlContext.read.parquet("/files/dataset/").collect

但是在spark2.0.1中,此代码会抛出错误:

val arr = spark.read.parquet("/files/dataset/").collect


java.lang.NullPointerException
    at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745) 

单独的分区目录可以单独阅读和联合,但我很好奇我应该寻找的差异。

更新:分区目录是通过三次单独写入写入的, 例如df.where(id=1).write.parquet而不是df.write.partitionBy。这似乎是问题的根源。但是,我正在积极地尝试确定读取/收集在之前版本的spark中成功的原因。

UPDATE:上面的'id'列是一个Long,当明确写入时(例如df.write.parquet('/ files / dataset / id = 1')在读取时抛出错误。分区发现显然试图读取分区为IntType而不是Long。请参阅https://issues.apache.org/jira/browse/SPARK-18108

0 个答案:

没有答案