Spark中的问题无法读取配置单元表的S3子文件夹

时间:2017-08-05 14:05:58

标签: amazon-s3 pyspark hiveql union-all

我在Hive中有3个非分区表。

drop table default.test1;
CREATE EXTERNAL TABLE `default.test1`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test1/';

drop table default.test2;
CREATE EXTERNAL TABLE `default.test2`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test2/'; 


drop table default.test3;
CREATE EXTERNAL TABLE `default.test3`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test3/';`

`
--INSERT:
insert into default.test1 values("a","b","c");
insert into default.test2 values("d","e","f");
insert overwrite default.test3 select * from default.test1 union all default.test2;
`

如果我查看s3,则因为union all操作而创建了两个子文件夹。

aws s3 ls s3://bucket_name/dev/sri/sri_test3/`;

PRE 1/
PRE 2/

现在的问题是,如果我尝试在pyspark中读取default.test3表并创建数据帧。

df = spark.sql("select * from default.test3")                               
df.count()                                                                  
0  

如何解决此问题?

0 个答案:

没有答案