Scala Spark Dataframe - 每行的总和Array [Double]的内容

时间:2017-09-08 16:02:51

标签: scala apache-spark apache-spark-sql spark-dataframe

这是我的基本数据框:

root |-- user_id: string (nullable = true) 
     |-- review_id: string (nullable = true) 
     |-- review_influence: double (nullable = false)

目标是为每个user_id设置review_influence的总和。所以我试着聚合数据并总结如下:

val review_influence_listDF = review_with_influenceDF
.groupBy("user_id")
.agg(collect_list("review_id") as("list_review_id"), collect_list("review_influence") as ("list_review_influence"))
.agg(sum($"list_review_influence"))

但我有这个错误:

org.apache.spark.sql.AnalysisException: cannot resolve 'sum(`list_review_influence`)' due to data type mismatch: function sum requires numeric types, not ArrayType(DoubleType,true);;

我该怎么办?

1 个答案:

答案 0 :(得分:1)

您可以直接对agg函数中的列进行求和:

review_with_influenceDF
    .groupBy("user_id")
    .agg(collect_list($"review_id").as("list_review_id"), 
         sum($"review_influence").as("sum_review_influence"))