我正在尝试从包含数天数据的s3存储桶中读取多个镶木地板文件。
s3路径:s3n://<s3path>/dt=*/*.snappy.parquet
用于从多个镶木地板文件中读取数据的Pyspark代码:
s="SELECT * FROM parquet.`s3n://<s3path>/dt=*/*.snappy.parquet`"
c = sqlContext.sql(s)
c.show(2)
错误是:
[Stage 3:> (0 + 3) / java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
工作代码从s3存储桶读取单个文件:
s2="SELECT * FROM parquet.`s3n://<s3path>/dt=2016-02-02/*.snappy.parquet`"
c2 = sqlContext.sql(s2)
c2.show(2)
+------+--------+---------------+-------------+-----------+
|CL1 |CL2 |CL3 |CL3 | r_CD |
+------+--------+---------------+-------------+------------
| 18 |YGh4c | 2016-02-02| 00:32:02| AC |
| 18 |YGh4c | 2016-02-02| 00:32:02| IC |
+------+--------+---------------+-------------+------------
我可以使用相同的pyspark命令分别读取存储桶中的文件。为什么扫描整个存储桶会出错?这样做的正确方法是什么?