Question

我有一个带有Hive列的String，其数据类型如下：

val hiveColums = "bill_id:bigint|item_id:bigint|update_date:timestamp|updated_by:bigint|creation_date:timestamp|created_by:bigint|common_item_id:bigint|comment:string|pending_inv:string|category:string"

我有一个数据框：dataDF，其中包含使用以下语句读取表后获得的顺序的相同列：

val readQuery = "select bill_id, item_id, update_date, updated_by, creation_date, created_by, common_item_id, comment, pending_inv, category from schema.rdbmstable where product_type='MACHINERY'"
val dataDF    = spark.read.format("jdbc").option("url",url).option("dbtable", readQuery as machineData).option("user",username).option("password",pwd).load()

我使用hiveColums创建了一个hive表，如下所示。

val cols = hiveColums.replace("|", ",").replace(":"," ")
def CREATE_TABLE(spark: SparkSession, location: String, cols: String, hiveTable: String): Unit = {
    val serde    = "org.apache.hadoop.hive.serde2.OpenCSVSerde"
    val create   = s"CREATE TABLE IF NOT EXISTS hiveSchema.productTestTable (${cols}) ROW FORMAT SERDE '${serde}' location '${location}'"
    spark.sql(create)
}

在将数据插入到已创建的配置单元表中之前，我创建了一个具有所有列名的Seq，并以此顺序选择所有列到最终数据框中。

val columnSeq   = hiveColums.split("\\|").map(x => x.split(":")).map(x => x(0)).toSeq
val hiveColumns = columnSeq.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF    = dataDF.select(hiveColumns:_*)

如果我将数据框：resultDF插入我在上面创建的配置单元表中，则为：

resultDF.write.insertInto("hiveschema.productTestTable")

columnSeq中的列顺序是否与配置单元表productTestTable中的列顺序相同？如果没有，如何确保序列中的列顺序：columnSeq与表中的列productTestTable相同，只有这样，我才将数据帧插入其中，这样就不会遇到任何异常在加载数据时？如何确定列的顺序得到维护？

在创建Hive表并将数据框插入表时如何保持列顺序？

0 个答案: