我有两个数据集 AccountData 和 CustomerData ,以及相应的案例类:
case class AccountData(customerId: String, forename: String, surname: String)
customerId|accountId|balance|
+----------+---------+-------+
| IND0002| ACC0002| 200|
| IND0002| ACC0022| 300|
| IND0003| ACC0003| 400|
+----------+---------+-------+
case class CustomerData(customerId: String, accountId: String, balance: Long)
+----------+-----------+--------+
|customerId| forename| surname|
+----------+-----------+--------+
| IND0001|Christopher| Black|
| IND0002| Madeleine| Kerr|
| IND0003| Sarah| Skinner|
+----------+-----------+--------+
如何导出以下数据集,该数据集添加了包含每个 customerId 的Seq [ AccountData ]的列 accounts ?
+----------+-----------+----------------------------------------------+
|customerId|forename |surname |accounts |
+----------+-----------+----------+---------------------------------- +
|IND0001 |Christopher|Black |[]
|IND0002 |Madeleine |Kerr |[[IND0002,ACC002,200],[IND0002,ACC0022,300]]
|IND0003 |Sarah |Skinner |[[IND0003,ACC003,400]
我尝试过:
val joinCustomerAndAccount = accountDS.joinWith(customerDS, customerDS("customerId") === accountDS("customerId")).drop(col("_2"))
为我提供以下数据框:
+---------------------+
|_1 |
+---------------------+
|[IND0002,ACC0002,200]|
|[IND0002,ACC0022,300]|
|[IND0003,ACC0003,400]|
+---------------------+
如果我那么做:
val result = customerDS.withColumn("accounts", joinCustomerAndAccount("_1")(0))
我收到以下异常:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Field name should be String Literal, but it's 0;
答案 0 :(得分:1)
可以按“ customerId”对帐户进行分组,并与“客户”一起加入
// data
val accountDS = Seq(
AccountData("IND0002", "ACC0002", 200),
AccountData("IND0002", "ACC0022", 300),
AccountData("IND0003", "ACC0003", 400)
).toDS()
val customerDS = Seq(
CustomerData("IND0001", "Christopher", "Black"),
CustomerData("IND0002", "Madeleine", "Kerr"),
CustomerData("IND0003", "Sarah", "Skinner")
).toDS()
// action
val accountsGroupedDF = accountDS.toDF
.groupBy("customerId")
.agg(collect_set(struct("accountId", "balance")).as("accounts"))
val result = customerDS.toDF.alias("c")
.join(accountsGroupedDF.alias("a"), $"c.customerId" === $"a.customerId", "left")
.select("c.*","accounts")
result.show(false)
输出:
+----------+-----------+-------+--------------------------------+
|customerId|forename |surname|accounts |
+----------+-----------+-------+--------------------------------+
|IND0001 |Christopher|Black |null |
|IND0002 |Madeleine |Kerr |[[ACC0002, 200], [ACC0022, 300]]|
|IND0003 |Sarah |Skinner|[[ACC0003, 400]] |
+----------+-----------+-------+--------------------------------+