Question

我在S3上有实木复合地板文件，其中包含以下多个架构：

s3://my_parquet_files
|-cid=abc
    |-schema1.snappy.parquet
|-cid=xyz
    |-schema2.snappy.parquet

它们的架构是：

模式1：

|-- a: integer (nullable = true)
|-- b: integer (nullable = true)

模式2：

|-- a: integer (nullable = true)
|-- b: integer (nullable = true)
|-- c: integer (nullable = true)

在 EMR群集上，以下代码段可以正常工作：

df = spark.read.option("mergeSchema", "true").parquet('s3://my_parquet_files')
df.show(n=5)

但是，如果使用Spark SQL读取在Athena中创建的表相同的数据，则会引发错误：

spark = SparkSession.builder.appName('tmp').config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory").config("spark.sql.parquet.mergeSchema","true").enableHiveSupport().getOrCreate()
qry = 'select * from parquet_table'
df = spark.sql(qry)

抛出错误：

Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
...

EMR群集配置如下：

Hadoop发行版：Amazon 2.8.5
应用：火花2.4.0，配置单元2.3.4，livy 0.5.0，齐柏林飞艇0.8.0，神经节3.7.2

我想避免读取实木复合地板文件和createOrReplaceTempView并对其运行SQL。

是否需要设置其他配置？如何使用SQL查询读取数据？

Pyspark在使用合并架构读取Parquet文件时遇到问题

0 个答案: