如何在Spark / Scala中使用countDistinct?

时间:2017-07-03 13:27:51

标签: scala apache-spark dataframe

我正在尝试使用Scala聚合Spark数据框中的列,如下所示:

import org.apache.spark.sql._

dfNew.agg(countDistinct("filtered"))

但是我收到了错误:

 error: value agg is not a member of Unit

任何人都可以解释原因吗?

编辑:澄清我想要做的事情: 我有一个字符串数组的列,我想计算所有行上的不同元素,对任何其他列不感兴趣。数据:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered                                                                                                                                                      |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, , https://time.com/sxp3onz1w8]                                                                      |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                                                                                |

我想要过滤,给予:

rt:2, @dope_promo:1, crew:1, ...frog:2 etc

1 个答案:

答案 0 :(得分:2)

在计算出现次数之前,您需要先explode数组:查看每个元素的计数:

dfNew
.withColumn("filtered",explode($"filtered"))
.groupBy($"filtered")
.count
.orderBy($"count".desc)
.show

或只是为了得到不同元素的数量:

val count = dfNew
.withColumn("filtered",explode($"filtered"))
.select($"filtered")
.distinct
.count