DataFrame 1是我现在拥有的,我想编写一个Scala函数以使DataFrame 1看起来像DataFrame2。
转移是大类;电子转帐和IMT是子类别。
逻辑是,对于相同的ID(31898),如果同时为Transfer和e-Transfer都添加了标签,则只能为e-Transfer;如果Transfer和IMT和e-Transfer都标记了相同的ID(32614),则应为e-Transfer + IMT;如果仅将转移标记为一个ID(33987),则应为其他;如果仅将电子转帐或IMT标记为ID(34193),则应仅使用IMT进行电子转帐。
scala的新手,不知道如何编写一个好的函数来执行此操作。请帮忙!
DataFrame 1 DataFrame 2
+---------+-------------+ +---------+------------------+
| ID | Category | | ID | Category |
+---------+-------------+ +---------+------------------+
| 31898 | Transfer | | 31898 | e-Transfer |
| 31898 | e-Transfer | | 32614 | e-Transfer + IMT|
| 32614 | Transfer | =====> | 33987 | Other |
| 32614 | e-Transfer | =====> | 34193 | e-Transfer |
| 32614 | IMT | +---------+------------------+
| 33987 | Transfer |
| 34193 | e-Transfer |
+---------+-------------+
答案 0 :(得分:0)
您可以按ID
对DataFrame进行分组,以使用Category
聚合collect_set
来组装类别数组,并使用{{1}根据类别数组中的内容创建新列}:
array_contains
您的样本数据可能未涵盖所有情况(例如import org.apache.spark.sql.functions._
val df = Seq(
(31898, "Transfer"),
(31898, "e-Transfer"),
(32614, "Transfer"),
(32614, "e-Transfer"),
(32614, "IMT"),
(33987, "Transfer"),
(34193, "e-Transfer")
).toDF("ID", "Category")
df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
withColumn( "Category",
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
"e-Transfer + IMT").otherwise(
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
"e-Transfer").otherwise(
when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
$"CategorySet"(0)).otherwise(
when($"CategorySet" === Array("Transfer"), "Other")
)))
).
show(false)
// +-----+---------------------------+----------------+
// |ID |CategorySet |Category |
// +-----+---------------------------+----------------+
// |33987|[Transfer] |Other |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer] |e-Transfer |
// |31898|[Transfer, e-Transfer] |e-Transfer |
// +-----+---------------------------+----------------+
)。现有的示例代码将为其余所有情况生成[Transfer, MIT]
类别值。如果发现其他情况,只需修改/扩展条件检查。