读取PySpark中所有分区的实木复合地板文件

时间:2019-12-04 09:11:59

标签: apache-spark pyspark apache-spark-sql parquet

我想加载存储在S3 AWS的文件夹结构中的所有镶木地板文件。

文件夹结构如下:S3/bucket_name/folder_1/folder_2/folder_3/year=2019/month/day

我想要的是一次读取所有实木复合地板文件,因此我希望PySpark可以读取2019年以来所有可用的月份和日期的所有数据,然后将其存储在一个数据框中(这样您就可以得到一个串联/非连接的数据框2019年全天)。

有人告诉我这些是分区文件(尽管我不确定)。

在PySpark中有可能吗?

当我尝试spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019')时 有用。但是,当我想使用spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019').show()

查看Spark数据帧时

它说:

An error occurred while calling o934.showString.
: org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, 
most recent failure: 
Lost task 0.3 in stage 36.0 (TID 718, executor 7): 
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary 
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
        at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

我希望能够显示数据框。

2 个答案:

答案 0 :(得分:1)

答案 1 :(得分:0)

在PySpark中,您可以简单地执行以下操作:

from pyspark.sql.functions import col
(
  spark.read
  .parquet('S3/bucket_name/folder_1/folder_2/folder_3')
  .filter(col('year') == 2019)
)

因此,您将指向指向该文件夹的文件夹的子文件夹,并应用分区过滤器,该过滤器应仅从给定年份的子文件夹中获取数据。