拥有如下数据框:
val df = Seq(
(1, Seq("USD", "CAD")),
(2, Seq("AUD", "YEN", "USD")),
(2, Seq("GBP", "AUD", "YEN")),
(3, Seq("BRL", "AUS", "BND","BOB","BWP")),
(3, Seq("XAF", "CLP", "BRL")),
(3, Seq("XAF", "CNY", "KMF","CSK","EGP")
)
).toDF("ACC", "CCY")
+---+-------------------------+
|ACC|CCY |
+---+-------------------------+
|1 |[USD, CAD] |
|2 |[AUD, YEN, USD] |
|2 |[GBP, AUD, YEN] |
|3 |[BRL, AUS, BND, BOB, BWP]|
|3 |[XAF, CLP, BRL] |
|3 |[XAF, CNY, KMF, CSK, EGP]|
+---+-------------------------+
这也必须通过删除重复项来转换如下。
Spark Version = 2.0 Scala版本= 2.10
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD,CAD] |
|2 |[AUD,YEN,USD,GBP] |
|3 |[BRL,AUS,BND,BOB,BWP,XAF,CLP,CNY,KMF,CSK,EGP] |
+---+-------------------------------------------------------+
我尝试通过ACC列进行分组并聚合CCY,但不确定从那里开始。
这可以在不使用UDF的情况下完成吗?如果不是,那么我将如何使用UDF进行此操作? 请指教。
答案 0 :(得分:0)
下一个代码应返回预期结果:
import scala.collection.mutable.WrappedArray
val df = Seq(
(1, Seq("USD", "CAD")),
(2, Seq("AUD", "YEN", "USD")),
(2, Seq("GBP", "AUD", "YEN")),
(3, Seq("BRL", "AUS", "BND", "BOB", "BWP")),
(3, Seq("XAF", "CLP", "BRL")),
(3, Seq("XAF", "CNY", "KMF", "CSK", "EGP")
)
).toDF("ACC", "CCY")
val castToArray = udf((ccy: WrappedArray[WrappedArray[String]]) => ccy.flatten.distinct.toArray)
val df2 = df.groupBy($"ACC")
.agg(collect_list($"CCY").as("CCY"))
.withColumn("CCY", castToArray($"CCY"))
.show(false)
首先我使用groupBy(“ACC”)然后使用聚合collect_list将所有数组集中到一个数组中。接下来,在解开CCY的udf函数值内部,结果被展平。
输出:
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD, CAD] |
|3 |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2 |[AUD, YEN, USD, GBP] |
+---+-------------------------------------------------------+
祝你好运
<强>更新强>
在Spark&gt; = 2.4中,你可以使用内置的flatten和array_distinct函数,并避免使用udf:
df.groupBy($"ACC")
.agg(collect_list($"CCY").as("CCY"))
.select($"ACC", array_distinct(flatten($"CCY")).as("CCY"))
.show(false)
//Output
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD, CAD] |
|3 |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2 |[AUD, YEN, USD, GBP] |
+---+-------------------------------------------------------+