我正在尝试根据另一个数据框重命名数据框的列。我如何使用Scala实现这一目标?
基本上我的数据看起来像
DataFrame1
A B C D
1 2 3 4
我有另一个看起来像这个DataFrame2
的表Col1 Col2
A E
B Q
C R
D Z
我想重命名我的第一个数据框的列相对于其他数据框。所以预期的输出应该是这样的:
E Q R Z
1 2 3 4
我已经使用PySpark尝试了代码(从this answer复制user8371915),这很好用:
name_dict = dataframe2.rdd.collectAsMap()
dataframe1.select([dataframe[c].alias(name_dict.get(c, c)) for c in dataframe1.columns]).show()
现在,我如何使用Scala实现这一目标?
答案 0 :(得分:2)
根据需要使用火花1.6
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ColumnNameChange {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkSessionExample")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val df1 = Seq((1, 2, 3, 4)).toDF("A","B","C","D")
val df2 = Seq(("A", "E"),("B","Q"), ("C", "R"),("D","Z")).toDF("Col1","Col2")
val name_dict : scala.collection.Map[String,String] = df2.map(row => { row.getAs[String]("Col1") -> row.getAs[String]("Col2") }).collectAsMap()
val df3 = df1.select(df1.columns.map(c => col(c).as(name_dict.getOrElse(c, c))): _*)
df3.show()
}
}
答案 1 :(得分:2)
您也可以这样做(df1和df2与@AnuragSharma answer相同):
val spark: SparkSession = ???
import spark.implicits._
val to = df1.columns.toSeq.toDF.join(df2, $"value" === df2("Col1"))
.select("Col2")
.collect.map(row => (row.getString(0))).toList
val newDF = df1.toDF(to: _*)
newDF.show()
// +---+---+---+---+
// | E| Q| R| Z|
// +---+---+---+---+
// | 1| 2| 3| 4|
// +---+---+---+---+