我分享了我拥有的代码:
// define a case class
case class Zone(id: Int, team: String, members: Int ,name: String, lastname: String)
val df = Seq (
(1,"team1", 3, "Jonh", "Doe"),
(1,"team2", 4, "Jonh", "Doe"),
(1,"team3", 5, "David", "Luis"),
(2,"team4", 6, "Michael", "Larson"))
.toDF("id", "team", "members", "name", "lastname").as[Zone]
val df_grouped = df
.withColumn("team_info", to_json(struct(col("team"), col("members"))))
.withColumn("users", to_json(struct(col("name"), col("lastname"))))
.groupBy("id")
.agg(collect_list($"team_info").alias("team_info"), collect_list($"users").alias("users"))
df_grouped.show
+---+--------------------+--------------------+
| id| team_info| users|
+---+--------------------+--------------------+
| 1|[{"team":"team1",...|[{"name":"Jonh","...|
| 2|[{"team":"team4",...|[{"name":"Michael...|
+---+--------------------+--------------------+
我需要删除“用户”列中的重复项,因为在我的情况下,如果数组内的json完全相同,则为重复项。有什么方法可以使用df.withColumn或任何其他方法来更改该列的值?
答案 0 :(得分:0)
这可能不是最优雅的解决方案,但它应该可以工作:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
val df = sc.parallelize(
Array("[{\"name\":\"John\",\"lastName\":\"Doe\"},{\"name\":\"John\",\"lastName\":\"Doe\"},{\"name\":\"David\",\"lastName\":\"Luis\"}]")
).toDF("users")
case class Users(name: String, lastName: String)
val schema = ArrayType(Encoders.product[Users].schema)
df.withColumn("u", from_json($"users", schema))
.select("u")
.as[Array[Users]]
.map(_.distinct)
.toDF("u")
.withColumn("users", to_json($"u"))
.select("users")
假设您的用户将具有比示例中更多的属性,只需将这些属性添加到case类中即可。只要类型简单,Encoder
应该自动推断模式。
答案 1 :(得分:0)
您可以使用爆炸和 dropDuplicates 内置函数