如何将多个字符串值减少为列中的预定义类别

时间:2018-06-26 07:57:30

标签: scala apache-spark apache-spark-sql

我想基于预定义的模式匹配类别来减少数据框中特定列的值。

示例:

  val df = spark.createDataFrame(Seq(
  (1, "apple"),
  (2, "banana"),
  (3, "avocado"),
  (4, "potato"))).toDF("Id", "category")

Id  category
1   apple
2   banana
3   avocado
4   potato

所需的输出:

  val df_reduced = spark.createDataFrame(Seq(
  (1, "fruit"),
  (2, "fruit"),
  (3, "vegetable"),
  (4, "vegetable"))).toDF("Id", "category")

Id  category
1   fruit
2   fruit
3   vegetable
4   vegetable

这是我想出的解决方案:

df.withColumn("category", when(col("category") === "apple", regexp_replace(col("category"), "apple", "fruit"))
              .otherwise(when(col("category") === "banana", regexp_replace(col("category"), "banana", "fruit"))
              .otherwise(when(col("category") === "avocado", regexp_replace(col("category"), "avocado", "vegetable"))
              .otherwise(when(col("category") === "potato", regexp_replace(col("category"), "potato", "vegetable"))
                         ))))
.show

我真的不喜欢这种嵌套的“否则时嵌套”方法,所以我想知道:对于此任务是否有更好,更惯用的解决方案?

2 个答案:

答案 0 :(得分:1)

我认为,您应该像下面这样mapudf来帮助

import org.apache.spark.sql.functions._

val map=Map("Apple"->"fruit","Mango"->"fruit","potato"->"vegetable","avocado"->"vegetable","Banana"->"fruit")

val replaceUDF=udf((name:String)=>map.getOrElse(name, name))
val outputdf=df.withColumn("new_category", replaceUDF(col("category"))

示例输出:

+---+--------+------------+
| Id|category|new_category|
+---+--------+------------+
|  1|   Apple|       fruit|
|  2|  Banana|       fruit|
|  3|  potato|   vegetable|
|  4| avocado|   vegetable|
|  5|   Mango|       fruit|
+---+--------+------------+

答案 1 :(得分:1)

您可以将

创建为查找数据框
val lookupDF = spark.createDataFrame(Seq(
  ("apple", "fruit"),
  ("banana", "fruit"),
  ("avocado", "vegetable"),
  ("potato", "vegetable"))).toDF("category", "category2")
//    +--------+---------+
//    |category|category2|
//    +--------+---------+
//    |apple   |fruit    |
//    |banana  |fruit    |
//    |avocado |vegetable|
//    |potato  |vegetable|
//    +--------+---------+

由于查找数据帧肯定会很小,因此可以使用broadcast函数进行join

import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
    .select(col("Id"), col("category2").as("category"))
  .show(false)

应该给您

+---+---------+
|Id |category |
+---+---------+
|1  |fruit    |
|2  |fruit    |
|3  |vegetable|
|4  |vegetable|
+---+---------+

我希望答案会有所帮助

已更新

您已评论

  

缺少值怎么办?如果我在原始df中有一个类别,而该类别在查找df中不存在?我得到空,有关如何解决它的建议?如果在查找表中找不到匹配项,我希望保留原始值,但是我无法通过连接做到这一点

要解决这种情况,您可以将when/otherwise函数用作

import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
  .select(col("Id"), when(col("category2").isNotNull, col("category2")).otherwise(col("category")).as("category"))
  .show(false)