让我们解决以下玩具问题,我有以下案例类别:
case class Order(id: String, name: String, status: String)
case class TruncatedOrder(id: String)
case class Org(name: String, ord: Seq[TruncatedOrder])
我现在有了以下定义的变量
val ordersDF = Seq(Order("or1", "stuff", "shipped"), Order("or2", "thigns", "delivered") , Order("or3", "thingamabobs", "never received"), Order("or4", "???", "what?")).toDS()
val orgsDF = Seq(Org("tupper", Seq(TruncatedOrder("or1"), TruncatedOrder("or2"), TruncatedOrder("or3"))), Org("ware", Seq(TruncatedOrder("or3"), TruncatedOrder("or4")))).toDS()
我想要的是具有如下数据点,如下所示
Ord("tupper", Array(Joined("or1", "stuff", "shipped"), Joined("or2", "things", "delivered"), ...)
我想知道如何格式化我的join
语句和过滤语句。
答案 0 :(得分:3)
以下是我如何将数据转换为我想要的格式。这个答案受到@ulrich和@Mariusz提供的答案的启发。
val ud = udf((col: String, name: String, status: String) => { Seq(col, name, status)})
orgsDF
.select($"name".as("ordName"),explode($"ord.id"))
.join(ordersDF, $"col" === $"id").drop($"id")
.select($"ordName", ud($"col", $"name", $"status"))
.groupBy($"ordName")
.agg(collect_set($"order"))
.show()
+-------+--------------------------------------------------------------------------------------------------------------------------+
|ordName|orders |
+-------+--------------------------------------------------------------------------------------------------------------------------+
|ware |[WrappedArray(or4, ???, what?), WrappedArray(or3, thingamabobs, never received)] |
|tupper |[WrappedArray(or1, stuff, shipped), WrappedArray(or2, thigns, delivered), WrappedArray(or3, thingamabobs, never received)]|
+-------+--------------------------------------------------------------------------------------------------------------------------+
答案 1 :(得分:1)
这个怎么样?
spark.conf.set("HiveSupport.enabled", true)
orgsDF.select('name,explode('ord))
.map {case row: Row =>(row(0).toString,row(1).toString.filterNot("[]()".contains(_))) }.toDF("name",("ord"))
.join(ordersDF.select('id,'status,'name.as("name2") ),'ord === 'id).drop("id")
.select('name,concat('ord, lit(","),'Status, lit(","),'name2 ).as("info"))
.groupBy('name)
.agg(collect_set('info))
.show()
返回
+------+--------------------+
| name| collect_set(info)|
+------+--------------------+
| ware|[[or3,never recei...|
|tupper|[[or1,shipped,stu...|
+------+--------------------+
答案 2 :(得分:0)
如果您按照以下步骤操作,一对多很容易写:
orgsDF
tuppler
- thing