Question

我想通过spark sql在AWS S3上创建具有位置的托管表，但是如果我指定位置，即使没有指定此关键字，它也会创建EXTERNAL表。

CREATE TABLE IF NOT EXISTS database.tableOnS3(name string)
LOCATION 's3://mybucket/';

为什么它们在这里暗示EXTERNAL关键字...

如果我在hive控制台中执行此查询，它将创建托管表，那么如何在spark中执行相同的操作？

Answer 1

创建外部表后，将tableType更改为MANAGED。

导入org.apache.spark.sql.catalyst.TableIdentifier 导入org.apache.spark.sql.catalyst.catalog.CatalogTableType

val标识符= TableIdentifier（yourTableName，Some（yourDatabaseName）spark.sessionState.catalog.alterTable（spark.sessionState.catalog.getTableMetadata（identifier）.copy（tableType = CatalogTableType.MANAGED））

Answer 2

请参见docs Hive从根本上知道两种不同类型的表：

托管（内部）
外部

托管表：托管表存储在   hive.metastore.warehouse.dir路径属性，默认情况下在文件夹中   类似于/user/hive/warehouse/databasename.db/tablename/的路径。的   默认位置可以被location属性覆盖   表创建。如果删除了托管表或分区，则数据   并删除与该表或分区关联的元数据。如果   未指定“清除”选项，数据被移至回收站文件夹   在定义的持续时间内。

当Hive应该管理表的生命周期时，请使用托管表，   或生成临时表时。

外部表：外部表描述了元数据/架构   外部文件。外部表文件可以通过以下方式访问和管理   Hive外部的进程。外部表可以访问存储在   源，例如Azure存储卷（ASV）或远程HDFS位置。   如果更改外部表的结构或分区，则   MSCK REPAIR TABLE table_name语句可用于刷新元数据   信息。

在文件已经存在或位于远程时使用外部表   位置，即使删除表也应保留文件。

结论：

因为您使用的是s3位置，该位置在其外部显示。

您还想了解代码的工作原理，请参见CreateTableLikeCommand：在此val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL中，它是动态决定的地方...

/**
 * A command to create a table with the same definition of the given existing table.
 * In the target table definition, the table comment is always empty but the column comments
 * are identical to the ones defined in the source table.
 *
 * The CatalogTable attributes copied from the source table are storage(inputFormat, outputFormat,
 * serde, compressed, properties), schema, provider, partitionColumnNames, bucketSpec.
 *
 * The syntax of using this command in SQL is:
 * {{{
 *   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
 *   LIKE [other_db_name.]existing_table_name [locationSpec]
 * }}}
 */
case class CreateTableLikeCommand(
    targetTable: TableIdentifier,
    sourceTable: TableIdentifier,
    location: Option[String],
    ifNotExists: Boolean) extends RunnableCommand {

  override def run(sparkSession: SparkSession): Seq[Row] = {
    val catalog = sparkSession.sessionState.catalog
    val sourceTableDesc = catalog.getTempViewOrPermanentTableMetadata(sourceTable)

    val newProvider = if (sourceTableDesc.tableType == CatalogTableType.VIEW) {
      Some(sparkSession.sessionState.conf.defaultDataSourceName)
    } else {
      sourceTableDesc.provider
    }

    // If the location is specified, we create an external table internally.
    // Otherwise create a managed table.
    val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL

    val newTableDesc =
      CatalogTable(
        identifier = targetTable,
        tableType = tblType,
        storage = sourceTableDesc.storage.copy(
          locationUri = location.map(CatalogUtils.stringToURI(_))),
        schema = sourceTableDesc.schema,
        provider = newProvider,
        partitionColumnNames = sourceTableDesc.partitionColumnNames,
        bucketSpec = sourceTableDesc.bucketSpec)

    catalog.createTable(newTableDesc, ifNotExists)
    Seq.empty[Row]
  }
}

更新： 如果我在hive控制台中执行此查询，它将创建托管表，那么如何在spark中执行相同的操作？

希望您使用的是蜂巢和Spark并存的相同本地位置（不是不同的vpc）。如果是这样，则设置

spark.sql.warehouse.dir = hdfs：/// ...到s3位置

使用spark conf ....，您可能需要设置访问密钥和秘密ID凭据以创建spark会话的spark配置对象。

Answer 3

请查看Hive Confluence中的文档，重点是我自己的文档。

该文档列出了两者之间的某些区别，但根本区别是Hive假定它拥有托管表的数据。这意味着数据，其属性和数据布局将并且只能通过Hive命令进行更改。数据仍然存在于普通文件系统中，并且没有阻止您更改它而无需告知Hive。如果这样做确实违反了Hive的不变性和期望，则可能会看到不确定的行为。

从本质上讲，假设EXTERNAL的原因是因为您正在设置位置，因此，Hive不拥有/无法控制数据。

执行此操作的方法（即创建具有自定义位置的MANAGED表）是首先创建具有位置集的EXTERNAL表。由于上述原因，无法避免这种情况，然后将表元数据修改为MANAGED。请注意，如文档所述，这可能导致未定义的行为。

// Following your example Hive statement creates an EXTERNAL table
CREATE TABLE IF NOT EXISTS database.tableOnS3(name string) LOCATION 's3://mybucket/';

// Change table type from within Hive, changing from EXTERNAL to MANAGED
ALTER TABLE database.tableOnS3 SET TBLPROPERTIES('EXTERNAL'='FALSE');

// Or from within spark
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

// Get External Catalog
val catalog = spark.sharedState.externalCatalog

// Identify the table in question
val identifier = TableIdentifier("tableOnS3", Some("database"))

// Get its current metadata
val tableMetadata = catalog.getTableMetadata(identifier)

// Clone the metadata while changing the tableType to MANAGED
val alteredMetadata = tableMetadata.copy(tableType = CatalogTableType.MANAGED)

// Alter the table using the new metadata
catalog.alterTable(alteredMetadata)

现在您有了一个MANAGED表，该表中的位置已手动设置。

如何通过Spark SQL在指定位置创建托管配置单元表？

3 个答案: