我有一个名为DF1的数据框,如下所示。
DF1:
srcColumnZ|srcCoulmnY|srcCoulmnR|
+---------+----------+----------+
|John |Non Hf |New york |
|Steav |Non Hf |Mumbai |
|Ram |HF |Boston |
并且还有一个包含源到目标列映射的映射列表,如下所示。
List(Map(targetColumn -> columnNameX, sourceColumn -> List(srcColumnX, srcColumnY, srcColumnZ, srcColumnP, srcColumnQ, srcColumnR)), Map(targetColumn -> columnNameY, sourceColumn -> List(srcColumnY)), Map(targetColumn -> columnNameZ, selectvalue -> 5))
我想根据以上Map列表创建一个数据框,在该数据框中,我需要columnNameX,columnNameY,columnNameZ作为列(根据上述列表),这些列的值将基于sourceColumn,即像List(srcColumnX,srcColumnY,srcColumnZ,srcColumnP,srcColumnQ,srcColumnR)一样存在sourceColumn,然后它将逐一检查DF1中的所有列,并且只要第一列匹配,它将将该列的所有值移动到目标列,下一个目标列相同。并且如果存在selectvalue而不是源列,它将把该值硬编码到整个列中。即:在上面的目标列(columnNameZ)列表中,存在选择值5
下面是预期的输出。
columnNameX|columnNameY|columnNameZ|
+----------+-----------+-----------+
|John |Non Hf |5 |
|Steav |Non Hf |5 |
|Ram |HF |5 |
答案 0 :(得分:1)
这里最主要的是根据给定的query
生成list
map
,您可以在下面进行操作
//Input DF
val df=Seq(("John","Non Hf","New york"),("Steav","Non Hf","Mumbai"),("Ram","HF","Boston")).toDF("srcColumnZ", "srcColumnY", "srcColumnR")
//Input List
val mapList=List(Map("targetColumn" -> "columnNameX", "sourceColumn" -> List("srcColumnX", "srcColumnY", "srcColumnZ", "srcColumnP", "srcColumnQ", "srcColumnR")), Map("targetColumn" -> "columnNameY", "sourceColumn" -> List("srcColumnY")), Map("targetColumn" -> "columnNameZ", "selectvalue" -> 5))
//Get all the columns of df as list
val dfCols=df.columns.toList
//Then generate query list like below
val query = mapList.map { mp =>
if (mp.contains("sourceColumn")) {
val srcColumn = mp.getOrElse("sourceColumn", "sourceColumn key not found").toString.replace("List(", "").replace(")", "").split(",").map(_.trim).toList
val srcCol = srcColumn.filter(dfCols.contains(_)).head
df.col(srcCol.toString).alias(mp.getOrElse("targetColumn", "No Target column found").toString)
} else {
lit(mp.getOrElse("selectvalue", "No Target column found").toString.replace("(", "").replace(")", "").trim).alias(mp.getOrElse("targetColumn", "No Target column found").toString)
}
}
//Finally , fire the query
df.select(query:_*).show
//Sample output:
+-----------+-----------+-----------+
|columnNameX|columnNameY|columnNameZ|
+-----------+-----------+-----------+
| Non Hf| Non Hf| 5|
| Non Hf| Non Hf| 5|
| HF| HF| 5|
+-----------+-----------+-----------+