Question

我想计算分组字段的平均值，类似于下面的sql查询：

select count(*) as total_count
from tbl1
where col2 is NULL;

select col3, count(*)/total_count as avg_count
from tbl1
where col2 is NULL 
group by col3;

请查看我浏览过的Spark语句。我已经拥有total_count。

val df = sqlContext.read.parquet("/user/hive/warehouse/xxx.db/fff")
val badDF = df.filter("col2 = ' '").withColumn("INVALID_COL_NAME", lit("XXX"))
val badGrp1 = df.groupBy("col3").count() 
val badGrp2 = badGrp1.select(col("col3"),col("count").as("CNT"))

现在查找avg CNT/total_count，如何继续？

我尝试了地图和Row，它没有用。

val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))  ---> for now I am assuming 20 as total_count.

有人可以建议如何继续吗？

谢谢。

Answer 1

我对Scala了解不多，但是从您的代码中我认为您已在此代码行中将Row视为Scala Tuple：

val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))

要从Spark中的Row获取数据，您可以使用Row的方法，就像：

// suppose you are getting the 1st and 2nd value of row
// where the 2nd value (count) is a Long type value
row => Row(row.get(0), row.getLong(1)/20)

Spark org.apache.spark.sql.DataFrame：如何在第二列计算avg

1 个答案: