我是Spark的新手。我使用的是Spark 2.2版本。 我有以下JSON格式的输入数据。
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00053"],"td_cnt":[10.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00053"],"td_cnt":[5.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00054"],"td_cnt":[20.0]}
{"a_id":6336,"b_sum":10.0,"m_cd":["abc00056"],"td_cnt":[30.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00051"],"td_cnt":[12.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00057"],"td_cnt":[7.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00055"],"td_cnt":[10.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["abc00058"],"td_cnt":[20.0]}
{"a_id":6339,"b_sum":10.0,"m_cd":["null"],"td_cnt":[null]}
我希望按 a_id 和 b_sum 列中的记录进行分组,并收集 m_cd 列表和相应的 td_cnt
中的记录array(["abc00053":15.0,"abc00054":20.0,"abc00056":30.0]).
然后将 td_cnt 列值的总和作为数据框中的 td_cnt 新列。
预期输出:
{"a_id":6336,"b_sum":10.0,"td_cnt":["abc00053":15.0,"abc00054":20.0,"abc00056":30.0],"td_cnt_sum":65}
{"a_id":6339,"b_sum":10.0,"td_cnt":["abc00051":12,"abc00057":7.0,"abc00055":10.0,'abc00058":20.0],"td_cnt_sum":49}
请帮帮我。
答案 0 :(得分:0)
您可以groupBy/agg
与sum
同时使用groupBy/agg
collect_list
这是一个例子'
您可以将json文件读为
val spark = SparkSession.builder().appName("read azure storage").master("local[*]").getOrCreate()
import spark.implicits._
val data = spark.read.json("path to json file ")
由于您的数据m_cd
和td_cnt
加载为array
,您需要选择第一个,如果它总是有一个值,否则爆炸来添加{ array
row
使用groupy两次获得输出
val df = data.select(
$"a_id", $"b_sum", $"m_cd" (0).as("m_cd"),
$"td_cnt" (0).as("td_cnt"))
.na.drop() //And also drop which has nulls in the value
输出:
val df1 = df.groupBy("a_id", "m_cd")
.agg(sum("td_cnt").as("td_cnt_sum"))
.groupBy("a_id")
.agg(collect_list(struct("m_cd", "td_cnt_sum")).as("td_cnt"), sum("td_cnt_sum").as("td_cnt_sum")
)
df1.show(false)
现在您已获得输出,您可以更改您的要求。 架构:
+----+-------------------------------------------------------------------+----------+
|a_id|td_cnt |td_cnt_sum|
+----+-------------------------------------------------------------------+----------+
|6336|[[abc00056,30.0], [abc00053,15.0], [abc00054,20.0]] |65.0 |
|6339|[[abc00055,10.0], [abc00051,12.0], [abc00057,7.0], [abc00058,20.0]]|49.0 |
+----+-------------------------------------------------------------------+----------+