Spark org.apache.spark.sql.DataFrame:如何在第二列计算avg

时间:2016-06-13 02:52:21

标签: apache-spark-sql

我想计算分组字段的平均值,类似于下面的sql查询:

select count(*) as total_count
from tbl1
where col2 is NULL;

select col3, count(*)/total_count as avg_count
from tbl1
where col2 is NULL 
group by col3;

请查看我浏览过的Spark语句。我已经拥有total_count

val df = sqlContext.read.parquet("/user/hive/warehouse/xxx.db/fff")
val badDF = df.filter("col2 = ' '").withColumn("INVALID_COL_NAME", lit("XXX"))
val badGrp1 = df.groupBy("col3").count() 
val badGrp2 = badGrp1.select(col("col3"),col("count").as("CNT")) 

现在查找avg CNT/total_count,如何继续?

我尝试了地图和Row,它没有用。

val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))  ---> for now I am assuming 20 as total_count.

有人可以建议如何继续吗?

谢谢。

1 个答案:

答案 0 :(得分:1)

我对Scala了解不多,但是从您的代码中我认为您已在此代码行中将Row视为Scala Tuple

  

val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))

要从Spark中的Row获取数据,您可以使用Row的方法,就像:

// suppose you are getting the 1st and 2nd value of row
// where the 2nd value (count) is a Long type value
row => Row(row.get(0), row.getLong(1)/20)