我有一个数据框,其Department值必须来自给定的一组值。
-----------------------
Id Name Department
-----------------------
1 John Sales
2 Martin Maintenance
3 Keith Sales
4 Rob Unknown
5 Kevin Unknown
6 Peter Maintenance
------------------------
Department的有效值存储在字符串数组中。 [“销售”,“维护”,“培训”]
如果DataFrame中的Department值不是允许的值,则必须将其替换为“ Training”。因此,新的DataFrame将是-
-----------------------
Id Name Department
-----------------------
1 John Sales
2 Martin Maintenance
3 Keith Sales
4 Rob Training
5 Kevin Training
6 Peter Maintenance
------------------------
什么是可行的解决方案?
答案 0 :(得分:0)
您可以通过将when/otherwise
,concat
和lit
内置函数用作
val validDepartments = Array("Sales","Maintenance","Training")
import org.apache.spark.sql.functions._
df.withColumn("Department", when(concat(validDepartments.map(x => lit(x)):_*).contains(col("Department")), col("Department")).otherwise("Training")).show(false)
应该给您
+---+------+---+-----------+
|Id |Name |Age|Department |
+---+------+---+-----------+
|1 |John |35 |Sales |
|2 |Martin|34 |Maintenance|
|3 |Keith |33 |Sales |
|4 |Rob |34 |Training |
|5 |Kevin |35 |Training |
|6 |Peter |36 |Maintenance|
+---+------+---+-----------+
一个简单的udf
函数也应该满足您的要求
val validDepartments = Array("Sales","Maintenance","Training")
import org.apache.spark.sql.functions._
def containsUdf = udf((department: String) => validDepartments.contains(department) match {case true => department; case false => "Training"} )
df.withColumn("Department", containsUdf(col("Department"))).show(false)
应该给您相同的结果
我希望答案会有所帮助