按组仅汇总不同的值

时间:2019-05-11 06:42:06

标签: apache-spark

我有一个看起来像这样的数据框:

Region   State  Volume   Hour   Price
South    GA     23       1      35
South    GA     23       2      50
South    FL     35       3      60
South    FL     35       4      22

同一区域,州将始终具有保存量。我想做的是总结整个地区的独特销量。因此,例如,结果数据框应如下所示:

Region   State  Volume   Hour   Price  TotalVolumeInRegion
South    GA     23       1      35     58
South    GA     23       2      50     58
South    FL     35       3      60     58
South    FL     35       4      22     58

注意我们如何只加23 +35。我们如何做到这一点?

1 个答案:

答案 0 :(得分:1)

由于不支持不同的窗口功能,因此可以通过联接来实现。

val df = Seq(
  ("South", "GA", 23, 1, 35),
  ("South", "GA", 23, 2, 50),
  ("South", "FL", 35, 3, 60),
  ("South", "FL", 35, 4, 22)
).toDF("Region", "State", "Volume", "Hour", "Price")

val totals = df
  .select($"Region", $"State", $"Volume")
  .distinct()
  .groupBy($"Region")
  .agg(sum($"Volume") as "TotalVolumeInRegion")

df.join(totals, usingColumn = "Region").show()

输出:

+------+-----+------+----+-----+-------------------+
|Region|State|Volume|Hour|Price|TotalVolumeInRegion|
+------+-----+------+----+-----+-------------------+
| South|   GA|    23|   1|   35|                 58|
| South|   GA|    23|   2|   50|                 58|
| South|   FL|    35|   3|   60|                 58|
| South|   FL|    35|   4|   22|                 58|
+------+-----+------+----+-----+-------------------+