在重塑数据帧时如何计算总和?
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDf.show()
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show()
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|user3 |biology |null |null |null |50 |
|user1 |math |90 |null |null |null |
|user2 |biology |null |null |null |100 |
|user2 |physics |null |null |20 |null |
|user1 |physics |null |70 |null |null |
+-------+---------+---------+-------+---------+------+
预期输出:应该获得所有教训名称的总和
+-------+---------+---------+-------+---------+------+----+
|user_id|course_id|algebra-1|gravity|gravity-2|health|sum |
+-------+---------+---------+-------+---------+------+----+
|user3 |biology |null |null |null |50 |50 |
|user1 |math |90 |null |null |null |90 |
|user2 |biology |null |null |null |100 |100 |
|user2 |physics |null |null |20 |null |20 |
|user1 |physics |null |70 |null |null |70 |
+-------+---------+---------+-------+---------+------+----+
但是如何获取特定course_id
和batch_id
的所有lesson_name字段得分值的总和?
有什么建议吗?
答案 0 :(得分:1)
我使用Window.partitionBy
这样取得的一些成就,可能对某人有用
import org.apache.spark.sql.expressions.Window
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
val assessmentAggDf = Window.partitionBy("user_id","course_id")
val aggregatedDF = someDF.withColumn("total_sum_score", sum("score") over assessmentAggDf)
val result = aggregatedDF.groupBy("user_id", "course_id","total_sum_score").pivot("lesson_name").agg(first("score"))
答案 1 :(得分:0)
@manju 我再次为您写信,但仅针对这个问题
火花2.4.3
scala> result.show
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+
将所有列名中的“-”替换为“ _”,因为这会导致错误访问selectExpr()中的数据框列。
scala> val new_cols = result.columns.map(x => x.replaceAll("-", "_"))
通常无法计算null和Integer的和,但是我们可以使用合并函数来检索所需的输出。
scala> result.toDF(new_cols : _*).selectExpr("*","coalesce(algebra_1, 0) +coalesce(gravity, 0)+coalesce(gravity_2, 0)+coalesce(health,0) sum ").show
+-------+---------+---------+-------+---------+------+-----+
|user_id|course_id|algebra_1|gravity|gravity_2|health| sum|
+-------+---------+---------+-------+---------+------+-----+
| user3| biology| null| null| null| 50| 50.0|
| user1| math| 90| null| null| null| 90.0|
| user2| biology| null| null| null| 100|100.0|
| user2| physics| null| null| 20| null| 20.0|
| user1| physics| null| 70| null| null| 70.0|
+-------+---------+---------+-------+---------+------+-----+
详细了解coalesce 让我知道您是否还有其他疑问。如果它解决了您的问题,则接受此答案。Happy Hadoop