使用spark数据框,我需要根据用户ID将行值转换为列和分区,并创建一个csv文件。
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDF.show(false)
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show(false)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+
使用上面的代码,我可以将行值(lesson_name)转换为列名。
但是我需要在course_wise
在csv中预期的格式如下所示。
biology.csv // Expected Output
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50 |
| user2| biology| 100 |
+-------+---------+-------
physics.csv // Expected Output
+-------+---------+---------+-------
|user_id|course_id|gravity-2|gravity|
+-------+---------+---------+-------+
| user2| physics| 50 | null |
| user1| physics| 100 | 70 |
+-------+---------+---------+-------+
**注意:csv中的每门课程都应仅包含其特定的课程名称,并且不应包含任何不相关的课程名称。
实际上,在csv中,我可以在formate以下**
result.write
.partitionBy("course_id")
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("header", "true")
.save(somepath)
例如:
biology.csv // Wrong output, Due to it is containing non-relevant course lesson's(algebra-1,gravity-2,algebra-1)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user2| biology| null| null| null| 100|
+-------+---------+---------+-------+---------+------+
任何人都可以帮助解决此问题吗?
答案 0 :(得分:0)
只需按顺序过滤一下,即可:
val result = someDF.filter($"course_id" === "physics").groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
+-------+---------+-------+---------+
|user_id|course_id|gravity|gravity-2|
+-------+---------+-------+---------+
|user2 |physics |null |20 |
|user1 |physics |70 |null |
+ ------- + --------- + ------- + --------- +
答案 1 :(得分:0)
我假设您的意思是您想通过course_id将数据保存到单独的目录中。您可以使用这种方法。
scala> val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
scala> val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
scala> val eventNames = result.select($"course_id").distinct().collect()
var eventlist =eventNames.map(x => x(0).toString)
for (eventName <- eventlist) {
val course = result.where($"course_id" === lit(eventName))
//remove null column
val row = course
.select(course.columns.map(c => when(col(c).isNull, 0).otherwise(1).as(c)): _*)
.groupBy().max(course.columns.map(c => c): _*)
.first
val colKeep = row.getValuesMap[Int](row.schema.fieldNames)
.map{c => if (c._2 == 1) Some(c._1) else None }
.flatten.toArray
var final_df = course.select(row.schema.fieldNames.intersect(colKeep)
.map(c => col(c.drop(4).dropRight(1))): _*)
final_df.show()
final_df.coalesce(1).write.mode("overwrite").format("csv").save(s"${eventName}")
}
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50|
| user2| biology| 100|
+-------+---------+------+
+-------+---------+-------+---------+
|user_id|course_id|gravity|gravity-2|
+-------+---------+-------+---------+
| user2| physics| null| 20|
| user1| physics| 70| null|
+-------+---------+-------+---------+
+-------+---------+---------+
|user_id|course_id|algebra-1|
+-------+---------+---------+
| user1| math| 90|
+-------+---------+---------+
如果它满足您的目的,请接受答案。HappyHadoop