我正在尝试将RDBMS(Greenplum)表导入Hive。我读了表并从中获取一个dataFrame,如下所示:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", "(select * from schema.table where source_system_name='DB2' and period_year='2017') as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",15)
.load()
上述DF的架构为:
forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
ptd_balance:numeric
xx_data_hash_id:bigint
xx_pk_id:bigint
为了将上述数据框引入到配置单元中,我将模式放入列表中,并将所有的greenplum数据类型更改为与配置单元兼容的数据类型。
我有一张地图:dataMapper
,告诉gp的哪种数据类型应转换为Hive的
class ChangeDataTypes(val gpColumnDetails: List[String], val dataMapper: Map[String, String]) {
val dataMap: Map[String, String] = dataMapper
def gpDetails(): String = {
val hiveDataTypes = gpColumnDetails.map(_.split(":\\s*")).map(s => s(0) + " " + dMap(s(1))).mkString(",")
hiveDataTypes
}
def dMap(gpColType: String): String = {
val patterns = dataMap.keySet
val mkey = patterns.dropWhile{
p => gpColType != p.r.findFirstIn(gpColType).getOrElse("")
}.headOption match {
case Some(p) => p
case None => ""
}
dataMap.getOrElse(mkey, "n/a")
}
}
这些是上面代码执行后的数据类型:
forecast_id:bigint
period_year:bigint
period_num:bigint
period_name:String
source_system_name:String
source_record_type:String
ptd_balance:double
xx_data_hash_id:bigint
xx_pk_id:bigint
由于我的Hive表是根据source_system_name和period_year动态分区的,因此我需要通过将列数据source_system_name & period_year
移动到数据框的末尾作为hive表的分区列来更改数据框的内容当在表中插入数据进行动态分区时,表应该是表的最后一个。
谁能告诉我如何移动列:source_system_name和dataframe:period_year从其当前位置到其末尾(实质上是重新排列列)?
答案 0 :(得分:1)
从主列表中提取列,然后在末尾追加并在DataFrame上执行选择:
val lastCols = Seq("col1","col2")
val allColOrdered = df.columns.diff(lastCols) ++ lastCols
val allCols = allColOrdered.map(cn => org.apache.spark.sql.functions.col(cn))
val result = df.select(allCols: _*)