我想计算分组字段的平均值,类似于下面的sql查询:
select count(*) as total_count
from tbl1
where col2 is NULL;
select col3, count(*)/total_count as avg_count
from tbl1
where col2 is NULL
group by col3;
请查看我浏览过的Spark语句。我已经拥有total_count
。
val df = sqlContext.read.parquet("/user/hive/warehouse/xxx.db/fff")
val badDF = df.filter("col2 = ' '").withColumn("INVALID_COL_NAME", lit("XXX"))
val badGrp1 = df.groupBy("col3").count()
val badGrp2 = badGrp1.select(col("col3"),col("count").as("CNT"))
现在查找avg CNT/total_count
,如何继续?
我尝试了地图和Row,它没有用。
val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20)) ---> for now I am assuming 20 as total_count.
有人可以建议如何继续吗?
谢谢。
答案 0 :(得分:1)
我对Scala了解不多,但是从您的代码中我认为您已在此代码行中将Row
视为Scala Tuple
:
val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))
要从Spark中的Row
获取数据,您可以使用Row
的方法,就像:
// suppose you are getting the 1st and 2nd value of row
// where the 2nd value (count) is a Long type value
row => Row(row.get(0), row.getLong(1)/20)