Scala-数据框的条件替换列值

时间:2018-08-24 16:09:02

标签: scala apache-spark dataframe user-defined-functions

DataFrame 1是我现在拥有的,我想编写一个Scala函数以使DataFrame 1看起来像DataFrame2。

转移是大类;电子转帐和IMT是子类别。

逻辑是,对于相同的ID(31898),如果同时为Transfer和e-Transfer都添加了标签,则只能为e-Transfer;如果Transfer和IMT和e-Transfer都标记了相同的ID(32614),则应为e-Transfer + IMT;如果仅将转移标记为一个ID(33987),则应为其他;如果仅将电子转帐或IMT标记为ID(34193),则应仅使用IMT进行电子转帐。

scala的新手,不知道如何编写一个好的函数来执行此操作。请帮忙!

DataFrame 1                        DataFrame 2
+---------+-------------+          +---------+------------------+
|   ID    | Category    |          |   ID    | Category         |
+---------+-------------+          +---------+------------------+  
|  31898  |   Transfer  |          |  31898  |  e-Transfer      |  
|  31898  |  e-Transfer |          |  32614  |  e-Transfer + IMT|
|  32614  |   Transfer  |  =====>  |  33987  |   Other          |
|  32614  |  e-Transfer |  =====>  |  34193  |  e-Transfer      |
|  32614  |     IMT     |          +---------+------------------+
|  33987  |   Transfer  |  
|  34193  |  e-Transfer |  
+---------+-------------+

1 个答案:

答案 0 :(得分:0)

您可以按ID对DataFrame进行分组,以使用Category聚合collect_set来组装类别数组,并使用{{1}根据类别数组中的内容创建新列}:

array_contains

您的样本数据可能未涵盖所有情况(例如import org.apache.spark.sql.functions._ val df = Seq( (31898, "Transfer"), (31898, "e-Transfer"), (32614, "Transfer"), (32614, "e-Transfer"), (32614, "IMT"), (33987, "Transfer"), (34193, "e-Transfer") ).toDF("ID", "Category") df.groupBy("ID").agg(collect_set("Category").as("CategorySet")). withColumn( "Category", when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"), "e-Transfer + IMT").otherwise( when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"), "e-Transfer").otherwise( when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"), $"CategorySet"(0)).otherwise( when($"CategorySet" === Array("Transfer"), "Other") ))) ). show(false) // +-----+---------------------------+----------------+ // |ID |CategorySet |Category | // +-----+---------------------------+----------------+ // |33987|[Transfer] |Other | // |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT| // |34193|[e-Transfer] |e-Transfer | // |31898|[Transfer, e-Transfer] |e-Transfer | // +-----+---------------------------+----------------+ )。现有的示例代码将为其余所有情况生成[Transfer, MIT]类别值。如果发现其他情况,只需修改/扩展条件检查。