我正在尝试在火花数据框中显示几个不同列的明确计数,以及在对第一列进行分组后的记录计数。
因此,如果我有col1,col2和col3,我想groupBy col1,然后显示col2的独特计数以及col3的不同计数。
然后,我想在col1的同一groupBy之后显示记录计数。
最后,在一个共同声明中做这一切..
有什么想法吗?
答案 0 :(得分:1)
以下是您要查找的代码
df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
=======在下面测试============
scala> val lst = List(("a","x","d"),("b","D","s"),("ss","kk","ll"),("a","y","e"),("b","c","y"),("a","x","y"));
lst: List[(String, String, String)] = List((a,x,d), (b,D,s), (ss,kk,ll), (a,y,e), (b,c,y), (a,x,y))
scala> val rdd=sc.makeRDD(lst);
rdd: org.apache.spark.rdd.RDD[(String, String, String)] = ParallelCollectionRDD[7] at makeRDD at <console>:26
scala> val df = rdd.toDF("COL1","COL2","COL3");
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.printSchema
root
|-- COL1: string (nullable = true)
|-- COL2: string (nullable = true)
|-- COL3: string (nullable = true)
scala> df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
+----+--------------------+--------------------+--------+
|COL1|count(DISTINCT COL2)|count(DISTINCT COL3)|count(1)|
+----+--------------------+--------------------+--------+
| ss| 1| 1| 1|
| b| 2| 2| 2|
| a| 2| 3| 3|
+----+--------------------+--------------------+--------+
scala>