如何在Spark中获得总和

时间:2018-03-23 15:13:41

标签: apache-spark hadoop dataframe apache-spark-sql

我是Spark的新手。我使用的是Spark 2.2版本。 我有以下JSON格式的输入数据。

{"a_id":6336,"b_sum":10.0,"m_cd":["abc00053"],"td_cnt":[10.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00053"],"td_cnt":[5.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00054"],"td_cnt":[20.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00056"],"td_cnt":[30.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00051"],"td_cnt":[12.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00057"],"td_cnt":[7.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00055"],"td_cnt":[10.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00058"],"td_cnt":[20.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["null"],"td_cnt":[null]}

我希望按 a_id b_sum 列中的记录进行分组,并收集 m_cd 列表和相应的 td_cnt

中的记录
array(["abc00053":15.0,"abc00054":20.0,"abc00056":30.0]). 

然后将 td_cnt 列值的总和作为数据框中的 td_cnt 新列。

预期输出:

{"a_id":6336,"b_sum":10.0,"td_cnt":["abc00053":15.0,"abc00054":20.0,"abc00056":30.0],"td_cnt_sum":65}
{"a_id":6339,"b_sum":10.0,"td_cnt":["abc00051":12,"abc00057":7.0,"abc00055":10.0,'abc00058":20.0],"td_cnt_sum":49}

请帮帮我。

1 个答案:

答案 0 :(得分:0)

您可以groupBy/aggsum同时使用groupBy/agg collect_list

这是一个例子'

您可以将json文件读为

val spark = SparkSession.builder().appName("read azure storage").master("local[*]").getOrCreate()

import spark.implicits._
val data = spark.read.json("path to json file ")

由于您的数据m_cdtd_cnt加载为array,您需要选择第一个,如果它总是有一个值,否则爆炸来添加{ array

中的{1}}值
row

使用groupy两次获得输出

val df = data.select(
  $"a_id", $"b_sum", $"m_cd" (0).as("m_cd"),
  $"td_cnt" (0).as("td_cnt"))
  .na.drop() //And also drop which has nulls in the value

输出:

val df1 = df.groupBy("a_id", "m_cd")
  .agg(sum("td_cnt").as("td_cnt_sum"))
  .groupBy("a_id")
  .agg(collect_list(struct("m_cd", "td_cnt_sum")).as("td_cnt"), sum("td_cnt_sum").as("td_cnt_sum")
)

df1.show(false)

现在您已获得输出,您可以更改您的要求。 架构:

+----+-------------------------------------------------------------------+----------+
|a_id|td_cnt                                                             |td_cnt_sum|
+----+-------------------------------------------------------------------+----------+
|6336|[[abc00056,30.0], [abc00053,15.0], [abc00054,20.0]]                |65.0      |
|6339|[[abc00055,10.0], [abc00051,12.0], [abc00057,7.0], [abc00058,20.0]]|49.0      |
+----+-------------------------------------------------------------------+----------+