如何根据条件从列表[Map]创建数据框

时间:2018-08-20 09:49:59

标签: scala apache-spark apache-spark-sql scala-collections

我有一个名为DF1的数据框,如下所示。

DF1:

srcColumnZ|srcCoulmnY|srcCoulmnR| 
+---------+----------+----------+
|John     |Non Hf    |New york  |
|Steav    |Non Hf    |Mumbai    |
|Ram      |HF        |Boston    |

并且还有一个包含源到目标列映射的映射列表,如下所示。

List(Map(targetColumn -> columnNameX, sourceColumn -> List(srcColumnX, srcColumnY, srcColumnZ, srcColumnP, srcColumnQ, srcColumnR)), Map(targetColumn -> columnNameY, sourceColumn -> List(srcColumnY)), Map(targetColumn -> columnNameZ, selectvalue -> 5))

我想根据以上Map列表创建一个数据框,在该数据框中,我需要columnNameX,columnNameY,columnNameZ作为列(根据上述列表),这些列的值将基于sourceColumn,即像List(srcColumnX,srcColumnY,srcColumnZ,srcColumnP,srcColumnQ,srcColumnR)一样存在sourceColumn,然后它将逐一检查DF1中的所有列,并且只要第一列匹配,它将将该列的所有值移动到目标列,下一个目标列相同。并且如果存在selectvalue而不是源列,它将把该值硬编码到整个列中。即:在上面的目标列(columnNameZ)列表中,存在选择值5

下面是预期的输出。

columnNameX|columnNameY|columnNameZ| 
+----------+-----------+-----------+
|John      |Non Hf     |5          |
|Steav     |Non Hf     |5          |
|Ram       |HF         |5          |

1 个答案:

答案 0 :(得分:1)

这里最主要的是根据给定的query生成list map,您可以在下面进行操作

//Input DF
val df=Seq(("John","Non Hf","New york"),("Steav","Non Hf","Mumbai"),("Ram","HF","Boston")).toDF("srcColumnZ", "srcColumnY", "srcColumnR")

//Input List

val mapList=List(Map("targetColumn" -> "columnNameX", "sourceColumn" -> List("srcColumnX", "srcColumnY", "srcColumnZ", "srcColumnP", "srcColumnQ", "srcColumnR")), Map("targetColumn" -> "columnNameY", "sourceColumn" -> List("srcColumnY")), Map("targetColumn" -> "columnNameZ", "selectvalue" -> 5))

//Get all the columns of df as list

val dfCols=df.columns.toList

//Then generate query list like below

val query = mapList.map { mp =>
            if (mp.contains("sourceColumn")) {
                val srcColumn = mp.getOrElse("sourceColumn", "sourceColumn key not found").toString.replace("List(", "").replace(")", "").split(",").map(_.trim).toList
                val srcCol = srcColumn.filter(dfCols.contains(_)).head
                df.col(srcCol.toString).alias(mp.getOrElse("targetColumn", "No Target column found").toString)
            } else {
                lit(mp.getOrElse("selectvalue", "No Target column found").toString.replace("(", "").replace(")", "").trim).alias(mp.getOrElse("targetColumn", "No Target column found").toString)
            }
        }

//Finally , fire the query

df.select(query:_*).show

//Sample output:

+-----------+-----------+-----------+
|columnNameX|columnNameY|columnNameZ|
+-----------+-----------+-----------+
|     Non Hf|     Non Hf|          5|
|     Non Hf|     Non Hf|          5|
|         HF|         HF|          5|
+-----------+-----------+-----------+