我正在尝试在Spark数据帧列中获取不同值的频率,例如来自Python Pandas的“value_counts”。按频率我的意思是,表格列中出现的最高值(例如排名1值,排名2,排名3等等。在预期输出中,1列在a列中出现了9次,因此它具有最高频率。
我正在使用Spark SQL,但它没有用,可能是因为我写的reduce操作是错误的。
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
问题区域在下面查询。
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
我该如何解决这个问题?请帮忙。
答案 0 :(得分:0)
我可以使用count($field)
代替approx_count_distinct($field)
解决以下问题。然后我使用Rank
分析函数来获取first rank
的值。它奏效了。