Spark:从Multiple Arrays中创建一个独特的String列

时间:2018-04-12 17:25:31

标签: apache-spark-sql spark-dataframe

拥有如下数据框:

val df = Seq(
  (1, Seq("USD", "CAD")),
  (2, Seq("AUD", "YEN", "USD")),
  (2, Seq("GBP", "AUD", "YEN")),
  (3, Seq("BRL", "AUS", "BND","BOB","BWP")),
  (3, Seq("XAF", "CLP", "BRL")),
  (3, Seq("XAF", "CNY", "KMF","CSK","EGP")
  )
).toDF("ACC", "CCY")

+---+-------------------------+
|ACC|CCY                      |
+---+-------------------------+
|1  |[USD, CAD]               |
|2  |[AUD, YEN, USD]          |
|2  |[GBP, AUD, YEN]          |
|3  |[BRL, AUS, BND, BOB, BWP]|
|3  |[XAF, CLP, BRL]          |
|3  |[XAF, CNY, KMF, CSK, EGP]|
+---+-------------------------+

这也必须通过删除重复项来转换如下。

Spark Version = 2.0 Scala版本= 2.10

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD,CAD]                                              |
|2  |[AUD,YEN,USD,GBP]                                      |
|3  |[BRL,AUS,BND,BOB,BWP,XAF,CLP,CNY,KMF,CSK,EGP]          |
+---+-------------------------------------------------------+

我尝试通过ACC列进行分组并聚合CCY,但不确定从那里开始。

这可以在不使用UDF的情况下完成吗?如果不是,那么我将如何使用UDF进行此操作? 请指教。

1 个答案:

答案 0 :(得分:0)

下一个代码应返回预期结果:

    import scala.collection.mutable.WrappedArray
    val df = Seq(
      (1, Seq("USD", "CAD")),
      (2, Seq("AUD", "YEN", "USD")),
      (2, Seq("GBP", "AUD", "YEN")),
      (3, Seq("BRL", "AUS", "BND", "BOB", "BWP")),
      (3, Seq("XAF", "CLP", "BRL")),
      (3, Seq("XAF", "CNY", "KMF", "CSK", "EGP")
      )
    ).toDF("ACC", "CCY")

    val castToArray = udf((ccy: WrappedArray[WrappedArray[String]]) => ccy.flatten.distinct.toArray)

    val df2 = df.groupBy($"ACC")
      .agg(collect_list($"CCY").as("CCY"))
      .withColumn("CCY", castToArray($"CCY"))
        .show(false)

首先我使用groupBy(“ACC”)然后使用聚合collect_list将所有数组集中到一个数组中。接下来,在解开CCY的udf函数值内部,结果被展平。

输出:

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD, CAD]                                             |
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2  |[AUD, YEN, USD, GBP]                                   |
+---+-------------------------------------------------------+
祝你好运

<强>更新

在Spark&gt; = 2.4中,你可以使用内置的flatten和array_distinct函数,并避免使用udf:

df.groupBy($"ACC")
          .agg(collect_list($"CCY").as("CCY"))
          .select($"ACC", array_distinct(flatten($"CCY")).as("CCY"))
          .show(false)

//Output
+---+-------------------------------------------------------+ 
|ACC|CCY                                                    | 
+---+-------------------------------------------------------+ 
|1  |[USD, CAD]                                             | 
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]| 
|2  |[AUD, YEN, USD, GBP]                                   | 
+---+-------------------------------------------------------+