如何使用Spark截断数据并从Hive表中删除所有分区

时间:2019-05-20 11:07:03

标签: apache-spark hive apache-spark-sql hiveql

如何使用HiveSpark 2.3.0表中删除所有数据并删除所有分区

truncate table my_table; // Deletes all data, but keeps partitions in metastore

alter table my_table drop partition(p_col > 0) // does not work from spark

对我而言唯一有效的方法是遍历show partitions my_table,将/替换为,,然后分别删除每个分区。但是必须有一种更清洁的方法。如果分区列的类型为string,它甚至不起作用。有什么建议吗?

2 个答案:

答案 0 :(得分:1)

让我们使用Spark 2.4.3设置问题:

// We create the table
spark.sql("CREATE TABLE IF NOT EXISTS potato (size INT) PARTITIONED BY (hour STRING)")

// Enable dynamic partitioning 
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")

// Insert some dummy records
(1 to 9).map(i => spark.sql(s"INSERT INTO potato VALUES ($i, '2020-06-07T0$i')"))

// Verify inserts
spark.table("potato").count // 9 records

我们使用外部目录的listPartitionsdropPartitions函数。

// Get External Catalog
val catalog = spark.sharedState.externalCatalog

// Get the spec from the list of all partitions 
val partitions = catalog.listPartitions("default", "potato").map(_.spec)

// We pass them to the Catalog's dropPartitions function.
// If you purge data, it gets deleted immediately and isn't moved to trash.
// This takes precedence over retainData, so even if you retainData but purge,
// your data is gone.
catalog.dropPartitions("default", "potato", partitions,
                   ignoreIfNotExists=true, purge=true, retainData=false)
spark.table("potato").count // 0 records
catalog.listPartitions("default", "potato").length // 0 partitions

这对MANAGED表很好,但是EXTERNAL表又如何呢?

// We repeat the setup above but after creating an EXTERNAL table
// After dropping we see that the partitions appear to be gone (or are they?).
catalog.listPartitions("default", "potato").length // 0 partitions

// BUT repairing the table simply adds them again, the partitions/data 
// were NOT deleted from the underlying filesystem. This is not what we wanted!
spark.sql("MSCK REPAIR TABLE potato")
catalog.listPartitions("default", "potato").length // 9 partitions again!   

为此,我们在删除分区之前将表从EXTERNAL更改为MANAGED

import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

// Identify the table in question
val identifier = TableIdentifier("potato", Some("default"))

// Get its current metadata
val tableMetadata = catalog.getTableMetadata(identifier)

// Clone the metadata while changing the tableType to MANAGED
val alteredMetadata = tableMetadata.copy(tableType = CatalogTableType.MANAGED)

// Alter the table using the new metadata
catalog.alterTable(alteredMetadata)

// Now drop!
catalog.dropPartitions("default", "potato", partitions,
                   ignoreIfNotExists=true, purge=true, retainData=false)
spark.table("potato").count // 0 records
catalog.listPartitions("default", "potato").length // 0 partitions
spark.sql("MSCK REPAIR TABLE potato") // Won't add anything
catalog.listPartitions("default", "potato").length // Still 0 partitions!

请不要忘记使用EXTERNAL将表改回CatalogTableType.EXTERNAL

答案 1 :(得分:0)

Hive具有两种类型的表(托管表和外部表)。创建托管表的目的是Hive管理整个架构以及数据。因此,删除Hive托管表将删除架构,元数据和数据。但是,外部表中的数据位于其他地方(比如说S3之类的外部源)。因此,删除表仅会删除元数据和表,但源中的数据仍保持不变。

在您的情况下,当您截断表时,Hive应该维护元存储,因为表仍然存在于Hive中,仅删除了数据。此外,Metastore不保存数据,因为它仅包含有关架构的信息和其他相关的表详细信息。

我希望它能在某种程度上回答。

EDIT1:

Similar Post