带有额外列的Spark数据框枢轴重塑

时间:2019-09-04 09:54:03

标签: scala dataframe apache-spark

在重塑数据帧时如何计算总和?

val someDF = Seq(
  ("user1", "math","algebra-1","90"),
  ("user1", "physics","gravity","70"),
  ("user3", "biology","health","50"),
  ("user2", "biology","health","100"),
  ("user1", "math","algebra-1","40"),
  ("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")

someDf.show()
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
|  user1|     math|  algebra-1|   90|
|  user1|  physics|    gravity|   70|
|  user3|  biology|     health|   50|
|  user2|  biology|     health|  100|
|  user1|     math|  algebra-1|   40|
|  user2|  physics|  gravity-2|   20|
+-------+---------+-----------+-----+


val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show()
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|user3  |biology  |null     |null   |null     |50    |
|user1  |math     |90       |null   |null     |null  |
|user2  |biology  |null     |null   |null     |100   |
|user2  |physics  |null     |null   |20       |null  |
|user1  |physics  |null     |70     |null     |null  |
+-------+---------+---------+-------+---------+------+

预期输出:应该获得所有教训名称的总和

+-------+---------+---------+-------+---------+------+----+
|user_id|course_id|algebra-1|gravity|gravity-2|health|sum |
+-------+---------+---------+-------+---------+------+----+
|user3  |biology  |null     |null   |null     |50    |50  |
|user1  |math     |90       |null   |null     |null  |90  |
|user2  |biology  |null     |null   |null     |100   |100 | 
|user2  |physics  |null     |null   |20       |null  |20  | 
|user1  |physics  |null     |70     |null     |null  |70  | 
+-------+---------+---------+-------+---------+------+----+

但是如何获取特定course_idbatch_id的所有lesson_name字段得分值的总和?

有什么建议吗?

2 个答案:

答案 0 :(得分:1)

我使用Window.partitionBy这样取得的一些成就,可能对某人有用

import org.apache.spark.sql.expressions.Window

val someDF = Seq(
  ("user1", "math","algebra-1","90"),
  ("user1", "physics","gravity","70"),
  ("user3", "biology","health","50"),
  ("user2", "biology","health","100"),
  ("user1", "math","algebra-1","40"),
  ("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")

  val assessmentAggDf = Window.partitionBy("user_id","course_id")
  val aggregatedDF = someDF.withColumn("total_sum_score", sum("score") over assessmentAggDf)

  val result = aggregatedDF.groupBy("user_id", "course_id","total_sum_score").pivot("lesson_name").agg(first("score"))

答案 1 :(得分:0)

  

@manju 我再次为您写信,但仅针对这个问题

     

火花2.4.3

scala> result.show
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|  user3|  biology|     null|   null|     null|    50|
|  user1|     math|       90|   null|     null|  null|
|  user2|  biology|     null|   null|     null|   100|
|  user2|  physics|     null|   null|       20|  null|
|  user1|  physics|     null|     70|     null|  null|
+-------+---------+---------+-------+---------+------+
  

将所有列名中的“-”替换为“ _”,因为这会导致错误访问selectExpr()中的数据框列。

scala> val new_cols =  result.columns.map(x => x.replaceAll("-", "_"))
  

通常无法计算null和Integer的和,但是我们可以使用合并函数来检索所需的输出。

scala> result.toDF(new_cols : _*).selectExpr("*","coalesce(algebra_1, 0) +coalesce(gravity, 0)+coalesce(gravity_2, 0)+coalesce(health,0)  sum ").show
+-------+---------+---------+-------+---------+------+-----+
|user_id|course_id|algebra_1|gravity|gravity_2|health|  sum|
+-------+---------+---------+-------+---------+------+-----+
|  user3|  biology|     null|   null|     null|    50| 50.0|
|  user1|     math|       90|   null|     null|  null| 90.0|
|  user2|  biology|     null|   null|     null|   100|100.0|
|  user2|  physics|     null|   null|       20|  null| 20.0|
|  user1|  physics|     null|     70|     null|  null| 70.0|
+-------+---------+---------+-------+---------+------+-----+

详细了解coalesce 让我知道您是否还有其他疑问。如果它解决了您的问题,则接受此答案。Happy Hadoop