如何将AsTable保存到Hive(以便Hive表是MANAGED_TABLE)?

时间:2017-03-10 19:13:45

标签: apache-spark hive pyspark apache-spark-sql

当我尝试保存一个没有显式路径的表时,hivemetastore会有一个虚假的路径"属性指向" / user / hive / warehouse"而不是" / hive / warehouse"。如果我明确使用.option("路径"," / hive /仓库")设置路径,那么一切正常但Hive会创建外部表。有没有办法将托管表保存到hive Metastore,而没有那个与hive中文件位置不匹配的伪造路径属性?

from pyspark.sql import SparkSession

spark = SparkSession.builder.master(master_url).enableHiveSupport().getOrCreate()

df = spark.range(100)

df.write.saveAsTable("test1")
df.write.option("path", "/hive/warehouse").saveAsTable("test2")

hive> describe formatted test1;
OK
# col_name              data_type               comment             

id                      bigint                                      

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Fri Mar 10 18:53:07 UTC 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/hive/warehouse/test1 
Table Type:             MANAGED_TABLE            
Table Parameters:        
    spark.sql.sources.provider  parquet             
    spark.sql.sources.schema.numParts   1                   
    spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
    transient_lastDdlTime   1489171987          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    path                    file:/user/hive/warehouse/test1
    serialization.format    1                   
Time taken: 0.423 seconds, Fetched: 30 row(s)


hive> describe formatted test2;
OK
# col_name              data_type               comment             

id                      bigint                                      

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Fri Mar 10 16:02:07 UTC 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/hive/warehouse/test2   
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    COLUMN_STATS_ACCURATE   false               
    EXTERNAL                TRUE                
    numFiles                2                   
    numRows                 -1                  
    rawDataSize             -1                  
    spark.sql.sources.provider  parquet             
    spark.sql.sources.schema.numParts   1                   
    spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
    totalSize               4755                
    transient_lastDdlTime   1489161727          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    path                    file:/hive/warehouse/test2
    serialization.format    1                   
Time taken: 0.402 seconds, Fetched: 36 row(s)

2 个答案:

答案 0 :(得分:1)

修正了问题。对于那些有类似问题的人,我会发布我的修复程序。

将表保存到默认配置单元数据库时,只会出现“路径”参数不正确的问题(如下所示)。这让我觉得可能“旧”数据库使用旧配置值(hive.metastore.warehouse.dir),而新数据库使用新值。

因此,修复是删除默认数据库,重新创建数据库,现在在hive Metastore中创建的所有数据库都将使用正确的hive.metastore.warehouse.dir值。

spark.sql("create database testdb")
spark.sql("use testdb")
df.write.saveAsTable("test3")

hive> describe formatted test.test3;
OK
# col_name              data_type               comment             

id                      bigint                                      

# Detailed Table Information         
Database:               testdb                   
Owner:                  root                     
CreateTime:             Fri Mar 10 22:10:10 UTC 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/hive/warehouse/test.db/test3   
Table Type:             MANAGED_TABLE            
Table Parameters:        
    COLUMN_STATS_ACCURATE   false               
    numFiles                1                   
    numRows                 -1                  
    rawDataSize             -1                  
    spark.sql.sources.provider  parquet             
    spark.sql.sources.schema.numParts   1                   
    spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
    totalSize               409                 
    transient_lastDdlTime   1489183810          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    path                    file:/hive/warehouse/test.db/test3
    serialization.format    1                   
Time taken: 0.243 seconds, Fetched: 35 row(s)

答案 1 :(得分:0)

  

<强> hive.metastore.warehouse.dir

     
      
  • 默认值:/ user / hive / warehouse
  •   
  • 添加:Hive 0.2.0

         

    仓库的默认数据库的位置。

  •   
     

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties