Question

原始数据框

+-------+---------------+
| col_a |       col_b   |
+-------+---------------+
|    1  |          aaa  |
|    1  |          bbb  |
|    1  |          ccc  |
|    1  |          aaa  |
|    1  |          aaa  |
|    1  |          aaa  |
|    2  |          eee  |
|    2  |          eee  |
|    2  |          ggg  |
|    2  |          hhh  |
|    2  |          iii  |
|    3  |          222  |
|    3  |          333  |
|    3  |          222  |
+-------+---------------+

我需要的结果数据框

+----------------+---------------------+-----------+
| group_by_col_a | most_distinct_value | col_a cnt |
+----------------+---------------------+-----------+
|             1  |          aaa        |     6     |
|             2  |          eee        |     5     |
|             3  |          222        |     3     |
+----------------+---------------------+-----------+

这是我到目前为止所尝试的内容

val DF = originalDF
.groupBy($"col_a")
.agg(
max(countDistinct("col_b"))
count("col_a").as("col_a_cnt"))

和错误消息。 org.apache.spark.sql.AnalysisException：不允许在另一个聚合函数的参数中使用聚合函数。请在子查询中使用内部聚合函数。

有什么问题？有没有一种有效的方法来选择最独特的值？

Answer 1

您需要两个groupBy和join才能获得如下结果

  import spark.implicits._


  val data = spark.sparkContext.parallelize(Seq(
    (1, "aaa"), (1, "bbb"),
    (1, "ccc"), (1, "aaa"),
    (1, "aaa"), (1, "aaa"),
    (2, "eee"), (2, "eee"),
    (2, "ggg"), (2, "hhh"),
    (2, "iii"), (3, "222"),
    (3, "333"), (3, "222")
  )).toDF("a", "b")

  //calculating the count for coulmn a
  val countDF = data.groupBy($"a").agg(count("a").as("col_a cnt"))

  val distinctDF = data.groupBy($"a", $"b").count()
    .groupBy("a").agg(max(struct("count","b")).as("max"))
  //calculating and selecting the most distinct value 
    .select($"a", $"max.b".as("most_distinct_value"))
  //joining both dataframe to get final result
    .join(countDF, Seq("a"))

  distinctDF.show()

输出：

+---+-------------------+---------+
|  a|most_distinct_value|col_a cnt|
+---+-------------------+---------+
|  1|                aaa|        6|
|  3|                222|        3|
|  2|                eee|        5|
+---+-------------------+---------+

希望这有用！

Answer 2

另一种方法是，您可以使用RDD级别进行转换。由于RDD级转换速度要快DataFrame级。

val input = Seq((1, "aaa"), (1, "bbb"), (1, "ccc"), (1, "aaa"), (1, "aaa"),
    (1, "aaa"), (2, "eee"), (2, "eee"), (2, "ggg"), (2, "hhh"), (2, "iii"),
    (3, "222"), (3, "333"), (3, "222"))

import sparkSession.implicits._

val inputRDD: RDD[(Int, String)] = sc.parallelize(input)

皈依：

val outputRDD: RDD[(Int, String, Int)] =
    inputRDD.groupBy(_._1)
      .map(row =>
        (row._1,
          row._2.map(_._2)
            .groupBy(identity)
            .maxBy(_._2.size)._1,
          row._2.size))

现在，您可以创建数据框并显示。

val outputDf: DataFrame = outputRDD.toDF("col_a", "col_b", "col_a cnt")
outputDf.show()

输出：

+-----+-----+---------+
|col_a|col_b|col_a cnt|
+-----+-----+---------+
|    1|  aaa|        6|
|    3|  222|        3|
|    2|  eee|        5|
+-----+-----+---------+

Answer 3

您可以通过使用udf函数和collect_list函数（已经完成）来定义count函数来实现您的要求

在udf功能中，您可以发送{em>收集的列表的col_b值，并将 max occuring string 作为返回

import org.apache.spark.sql.functions._
def maxCountdinstinct =  udf((list: mutable.WrappedArray[String]) => {
                                list.groupBy(identity)                 // grouping with the strings
                                  .mapValues(_.size)                   // counting the grouped strings
                                  .maxBy(_._2)._1                      // returning the string with max count
                              }
                            )

您可以将udf功能称为

val DF = originalDF
  .groupBy($"col_a")
  .agg(maxCountdinstinct(collect_list("col_b")).as("most_distinct_value"), count("col_a").as("col_a_cnt"))

应该给你

+-----+-------------------+---------+
|col_a|most_distinct_value|col_a_cnt|
+-----+-------------------+---------+
|3    |222                |3        |
|1    |aaa                |6        |
|2    |eee                |5        |
+-----+-------------------+---------+

如何选择最独特的值或如何在Spark中执行内部/嵌套组？

3 个答案: