如何基于一组列值更新Spark Dataframe中的列?

时间:2018-06-26 16:07:52

标签: scala apache-spark dataframe

我有一个数据框,其Department值必须来自给定的一组值。

-----------------------
Id  Name    Department
-----------------------
1   John    Sales
2   Martin  Maintenance
3   Keith   Sales
4   Rob Unknown
5   Kevin   Unknown
6   Peter   Maintenance
------------------------

Department的有效值存储在字符串数组中。 [“销售”,“维护”,“培训”]

如果DataFrame中的Department值不是允许的值,则必须将其替换为“ Training”。因此,新的DataFrame将是-

-----------------------
Id  Name    Department
-----------------------
1   John    Sales
2   Martin  Maintenance
3   Keith   Sales
4   Rob     Training
5   Kevin   Training
6   Peter   Maintenance
------------------------

什么是可行的解决方案?

1 个答案:

答案 0 :(得分:0)

您可以通过将when/otherwiseconcatlit内置函数用作

来满足您的要求。
val validDepartments = Array("Sales","Maintenance","Training")

import org.apache.spark.sql.functions._
df.withColumn("Department", when(concat(validDepartments.map(x => lit(x)):_*).contains(col("Department")), col("Department")).otherwise("Training")).show(false)

应该给您

+---+------+---+-----------+
|Id |Name  |Age|Department |
+---+------+---+-----------+
|1  |John  |35 |Sales      |
|2  |Martin|34 |Maintenance|
|3  |Keith |33 |Sales      |
|4  |Rob   |34 |Training   |
|5  |Kevin |35 |Training   |
|6  |Peter |36 |Maintenance|
+---+------+---+-----------+

一个简单的udf函数也应该满足您的要求

val validDepartments = Array("Sales","Maintenance","Training")

import org.apache.spark.sql.functions._
def containsUdf = udf((department: String) => validDepartments.contains(department) match {case true => department; case false => "Training"} )

df.withColumn("Department", containsUdf(col("Department"))).show(false)

应该给您相同的结果

我希望答案会有所帮助