我有一个看起来像这样的数据框:
Region State Volume Hour Price
South GA 23 1 35
South GA 23 2 50
South FL 35 3 60
South FL 35 4 22
同一区域,州将始终具有保存量。我想做的是总结整个地区的独特销量。因此,例如,结果数据框应如下所示:
Region State Volume Hour Price TotalVolumeInRegion
South GA 23 1 35 58
South GA 23 2 50 58
South FL 35 3 60 58
South FL 35 4 22 58
注意我们如何只加23 +35。我们如何做到这一点?
答案 0 :(得分:1)
由于不支持不同的窗口功能,因此可以通过联接来实现。
val df = Seq(
("South", "GA", 23, 1, 35),
("South", "GA", 23, 2, 50),
("South", "FL", 35, 3, 60),
("South", "FL", 35, 4, 22)
).toDF("Region", "State", "Volume", "Hour", "Price")
val totals = df
.select($"Region", $"State", $"Volume")
.distinct()
.groupBy($"Region")
.agg(sum($"Volume") as "TotalVolumeInRegion")
df.join(totals, usingColumn = "Region").show()
输出:
+------+-----+------+----+-----+-------------------+
|Region|State|Volume|Hour|Price|TotalVolumeInRegion|
+------+-----+------+----+-----+-------------------+
| South| GA| 23| 1| 35| 58|
| South| GA| 23| 2| 50| 58|
| South| FL| 35| 3| 60| 58|
| South| FL| 35| 4| 22| 58|
+------+-----+------+----+-----+-------------------+