Question

我正在使用Spark生成Parquet文件（使用Snappy压缩由setid分区），并将它们存储在HDFS位置。

df.coalesce(1).write.partitionBy("SetId").
  mode(SaveMode.Overwrite).
  format("parquet").
  option("header","true").
  save(args(1))

镶木地板数据文件存储在/some-hdfs-path/testsp下

然后我为它创建Hive表，如下所示：

CREATE EXTERNAL TABLE DimCompany(
  CompanyCode string,
  CompanyShortName string,
  CompanyDescription string,
  BusinessDate string,
  PeriodTypeInd string,
  IrisDuplicateFlag int,
  GenTimestamp timestamp
) partitioned by (SetId int)
STORED AS PARQUET LOCATION '/some-hdfs-path/testsp'
TBLPROPERTIES ('skip.header.line.count'='1','parquet.compress'='snappy');

但是，当我在Hive中选择表格时，它不会显示任何结果。

我尝试过：

运行msck命令，例如：
```
msck repair table dimcompany;
```

设置以下内容：

spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")

这些都不起作用，我该如何解决呢？

Answer 1

问题是您的分区列SetId使用大写字母字母。

由于Hive将其列名转换为小写，因此分区列存储为setid而不是SetId。因此，当Hive在区分大小写的数据存储中搜索分区/文件夹时，它会寻找setid=some_value却什么也找不到，因为您的数据文件夹的格式为SetId=some_value。

要执行此操作，请将SetId转换为小写或snake_case。您可以通过为DataFrame中的列添加别名来使用它：

df.select(
... {{ your other_columns }} ...,
col("SetId").alias("set_id")
)

基于此StackOverflow post

，您可能还必须在执行create语句之前设置这些属性。

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

创建表后，还尝试运行

msck repair table <your_schema.your_table>;

配置单元外部表无法查看分区的Parquet文件

1 个答案: