原始数据框
+-------+---------------+
| col_a | col_b |
+-------+---------------+
| 1 | aaa |
| 1 | bbb |
| 1 | ccc |
| 1 | aaa |
| 1 | aaa |
| 1 | aaa |
| 2 | eee |
| 2 | eee |
| 2 | ggg |
| 2 | hhh |
| 2 | iii |
| 3 | 222 |
| 3 | 333 |
| 3 | 222 |
+-------+---------------+
我需要的结果数据框
+----------------+---------------------+-----------+
| group_by_col_a | most_distinct_value | col_a cnt |
+----------------+---------------------+-----------+
| 1 | aaa | 6 |
| 2 | eee | 5 |
| 3 | 222 | 3 |
+----------------+---------------------+-----------+
这是我到目前为止所尝试的内容
val DF = originalDF
.groupBy($"col_a")
.agg(
max(countDistinct("col_b"))
count("col_a").as("col_a_cnt"))
和错误消息。 org.apache.spark.sql.AnalysisException:不允许在另一个聚合函数的参数中使用聚合函数。请在子查询中使用内部聚合函数。
有什么问题? 有没有一种有效的方法来选择最独特的值?
答案 0 :(得分:3)
您需要两个groupBy
和join
才能获得如下结果
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "aaa"), (1, "bbb"),
(1, "ccc"), (1, "aaa"),
(1, "aaa"), (1, "aaa"),
(2, "eee"), (2, "eee"),
(2, "ggg"), (2, "hhh"),
(2, "iii"), (3, "222"),
(3, "333"), (3, "222")
)).toDF("a", "b")
//calculating the count for coulmn a
val countDF = data.groupBy($"a").agg(count("a").as("col_a cnt"))
val distinctDF = data.groupBy($"a", $"b").count()
.groupBy("a").agg(max(struct("count","b")).as("max"))
//calculating and selecting the most distinct value
.select($"a", $"max.b".as("most_distinct_value"))
//joining both dataframe to get final result
.join(countDF, Seq("a"))
distinctDF.show()
输出:
+---+-------------------+---------+
| a|most_distinct_value|col_a cnt|
+---+-------------------+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+---+-------------------+---------+
希望这有用!
答案 1 :(得分:1)
另一种方法是,您可以使用RDD
级别进行转换。由于RDD
级转换速度要快DataFrame
级。
val input = Seq((1, "aaa"), (1, "bbb"), (1, "ccc"), (1, "aaa"), (1, "aaa"),
(1, "aaa"), (2, "eee"), (2, "eee"), (2, "ggg"), (2, "hhh"), (2, "iii"),
(3, "222"), (3, "333"), (3, "222"))
import sparkSession.implicits._
val inputRDD: RDD[(Int, String)] = sc.parallelize(input)
皈依:
val outputRDD: RDD[(Int, String, Int)] =
inputRDD.groupBy(_._1)
.map(row =>
(row._1,
row._2.map(_._2)
.groupBy(identity)
.maxBy(_._2.size)._1,
row._2.size))
现在,您可以创建数据框并显示。
val outputDf: DataFrame = outputRDD.toDF("col_a", "col_b", "col_a cnt")
outputDf.show()
输出:
+-----+-----+---------+
|col_a|col_b|col_a cnt|
+-----+-----+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+-----+-----+---------+
答案 2 :(得分:1)
您可以通过使用udf
函数和collect_list
函数(已经完成)来定义count
函数来实现您的要求
在udf
功能中,您可以发送{em>收集的列表的col_b
值,并将 max occuring string 作为>返回
import org.apache.spark.sql.functions._
def maxCountdinstinct = udf((list: mutable.WrappedArray[String]) => {
list.groupBy(identity) // grouping with the strings
.mapValues(_.size) // counting the grouped strings
.maxBy(_._2)._1 // returning the string with max count
}
)
您可以将udf
功能称为
val DF = originalDF
.groupBy($"col_a")
.agg(maxCountdinstinct(collect_list("col_b")).as("most_distinct_value"), count("col_a").as("col_a_cnt"))
应该给你
+-----+-------------------+---------+
|col_a|most_distinct_value|col_a_cnt|
+-----+-------------------+---------+
|3 |222 |3 |
|1 |aaa |6 |
|2 |eee |5 |
+-----+-------------------+---------+