ColumnarBatch数据源出现Pushdown列失败

时间:2018-08-06 17:29:00

标签: apache-spark

我正在编写一个实现SupportsScanColumnarBatch,SupportsPushDownFilters和SupportsPushDownRequiredColumns的数据源。

在用与pruneColumns覆盖中提供的requiredSchema长度相同的ColumnVectors数量填充ColumnarBatch之后,在Spark的深处出现了ArrayIndexOutOfBoundsException。

我怀疑Spark正在寻找与readSchema覆盖返回的列模式一样多的ColumnVector,而不是使用pruneColumns提供的模式。

执行“ select * from dft”工作很好,因为架构长度相同-在我的测试案例中为15列。较少的内容(例如“从dft中选择col1,col2”)将返回以下堆栈跟踪,很明显,Spark正在寻找更多的列。

java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:98)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.datasourcev2scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

有什么线索可以解决这个问题吗?暂时,为了使事情继续运行,我忽略了pruneColumns调用并返回所有内容。

2 个答案:

答案 0 :(得分:1)

我解决了,但似乎有点不合时宜。

我所做的是创建一个与原始模式长度相同的ColumnVector数组(而不是修剪的列),然后仅填充修剪的列,使其他列保持其原始分配状态。

例如,如果修剪列表中仅原始索引的索引为0、5和9的列,则只需要这些。

var cva = new Array[ColumnVector](schema.length)
cva(0).putLongs(...)
cva(5).putInts(...)
cva(9).putFloats(...)
var batch = new ColumnarBatch(cva)
...

答案 1 :(得分:1)

找到了更明智的方法...

在实现SupportsPushDownRequiredColumns的过程中,让readSchema()方法返回您在StructType调用中得到的相同的pruneColumns()

基本反馈您从Spark获得的信息!

HTH