我在Hive中有3个非分区表。
drop table default.test1;
CREATE EXTERNAL TABLE `default.test1`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test1/';
drop table default.test2;
CREATE EXTERNAL TABLE `default.test2`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test2/';
drop table default.test3;
CREATE EXTERNAL TABLE `default.test3`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test3/';`
`
--INSERT:
insert into default.test1 values("a","b","c");
insert into default.test2 values("d","e","f");
insert overwrite default.test3 select * from default.test1 union all default.test2;
`
如果我查看s3,则因为union all操作而创建了两个子文件夹。
aws s3 ls s3://bucket_name/dev/sri/sri_test3/`;
PRE 1/
PRE 2/
现在的问题是,如果我尝试在pyspark中读取default.test3表并创建数据帧。
df = spark.sql("select * from default.test3")
df.count()
0
如何解决此问题?