Question

我正在尝试将RDBMS（Greenplum）表导入Hive。我读了表并从中获取一个dataFrame，如下所示：

val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                                                   .option("dbtable", "(select * from schema.table where source_system_name='DB2' and period_year='2017') as year2017")
                                                   .option("user", devUserName)
                                                   .option("password", devPassword)
                                                   .option("numPartitions",15)
                                                   .load()

上述DF的架构为：

forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
ptd_balance:numeric
xx_data_hash_id:bigint
xx_pk_id:bigint

为了将上述数据框引入到配置单元中，我将模式放入列表中，并将所有的greenplum数据类型更改为与配置单元兼容的数据类型。我有一张地图：dataMapper，告诉gp的哪种数据类型应转换为Hive的

class ChangeDataTypes(val gpColumnDetails: List[String], val dataMapper: Map[String, String]) {
  val dataMap: Map[String, String] = dataMapper
  def gpDetails(): String = {
    val hiveDataTypes = gpColumnDetails.map(_.split(":\\s*")).map(s => s(0) + " " + dMap(s(1))).mkString(",")
    hiveDataTypes
  }
  def dMap(gpColType: String): String = {
    val patterns = dataMap.keySet
    val mkey = patterns.dropWhile{
      p => gpColType != p.r.findFirstIn(gpColType).getOrElse("")
    }.headOption match {
      case Some(p) => p
      case None => ""
    }
    dataMap.getOrElse(mkey, "n/a")
  }
}

这些是上面代码执行后的数据类型：

forecast_id:bigint
period_year:bigint
period_num:bigint
period_name:String
source_system_name:String
source_record_type:String
ptd_balance:double
xx_data_hash_id:bigint
xx_pk_id:bigint

由于我的Hive表是根据source_system_name和period_year动态分区的，因此我需要通过将列数据source_system_name & period_year移动到数据框的末尾作为hive表的分区列来更改数据框的内容当在表中插入数据进行动态分区时，表应该是表的最后一个。

谁能告诉我如何移动列：source_system_name和dataframe：period_year从其当前位置到其末尾（实质上是重新排列列）？

Answer 1

从主列表中提取列，然后在末尾追加并在DataFrame上执行选择：

val lastCols = Seq("col1","col2")
val allColOrdered = df.columns.diff(lastCols) ++ lastCols
val allCols = allColOrdered.map(cn => org.apache.spark.sql.functions.col(cn))
val result = df.select(allCols: _*)

如何将DataFrame的选定列移到其末尾（重新排列列的位置）？

1 个答案: