我想基于预定义的模式匹配类别来减少数据框中特定列的值。
示例:
val df = spark.createDataFrame(Seq(
(1, "apple"),
(2, "banana"),
(3, "avocado"),
(4, "potato"))).toDF("Id", "category")
Id category
1 apple
2 banana
3 avocado
4 potato
所需的输出:
val df_reduced = spark.createDataFrame(Seq(
(1, "fruit"),
(2, "fruit"),
(3, "vegetable"),
(4, "vegetable"))).toDF("Id", "category")
Id category
1 fruit
2 fruit
3 vegetable
4 vegetable
这是我想出的解决方案:
df.withColumn("category", when(col("category") === "apple", regexp_replace(col("category"), "apple", "fruit"))
.otherwise(when(col("category") === "banana", regexp_replace(col("category"), "banana", "fruit"))
.otherwise(when(col("category") === "avocado", regexp_replace(col("category"), "avocado", "vegetable"))
.otherwise(when(col("category") === "potato", regexp_replace(col("category"), "potato", "vegetable"))
))))
.show
我真的不喜欢这种嵌套的“否则时嵌套”方法,所以我想知道:对于此任务是否有更好,更惯用的解决方案?
答案 0 :(得分:1)
我认为,您应该像下面这样map
和udf
来帮助
import org.apache.spark.sql.functions._
val map=Map("Apple"->"fruit","Mango"->"fruit","potato"->"vegetable","avocado"->"vegetable","Banana"->"fruit")
val replaceUDF=udf((name:String)=>map.getOrElse(name, name))
val outputdf=df.withColumn("new_category", replaceUDF(col("category"))
示例输出:
+---+--------+------------+
| Id|category|new_category|
+---+--------+------------+
| 1| Apple| fruit|
| 2| Banana| fruit|
| 3| potato| vegetable|
| 4| avocado| vegetable|
| 5| Mango| fruit|
+---+--------+------------+
答案 1 :(得分:1)
您可以将
创建为查找数据框val lookupDF = spark.createDataFrame(Seq(
("apple", "fruit"),
("banana", "fruit"),
("avocado", "vegetable"),
("potato", "vegetable"))).toDF("category", "category2")
// +--------+---------+
// |category|category2|
// +--------+---------+
// |apple |fruit |
// |banana |fruit |
// |avocado |vegetable|
// |potato |vegetable|
// +--------+---------+
由于查找数据帧肯定会很小,因此可以使用broadcast
函数进行join
import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
.select(col("Id"), col("category2").as("category"))
.show(false)
应该给您
+---+---------+
|Id |category |
+---+---------+
|1 |fruit |
|2 |fruit |
|3 |vegetable|
|4 |vegetable|
+---+---------+
我希望答案会有所帮助
已更新
您已评论
缺少值怎么办?如果我在原始df中有一个类别,而该类别在查找df中不存在?我得到空,有关如何解决它的建议?如果在查找表中找不到匹配项,我希望保留原始值,但是我无法通过连接做到这一点
要解决这种情况,您可以将when/otherwise
函数用作
import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
.select(col("Id"), when(col("category2").isNotNull, col("category2")).otherwise(col("category")).as("category"))
.show(false)