我有两个DataFrame
,并希望根据date
,time
,mid
,binImbalance
字段加入这些字段,并在{中收集相应的值列表中的{1}}和timeB
。
我尝试过使用以下代码:
midB
但这不起作用,因为我收到错误:val d1: DataFrame
val d3: DataFrame
val d2 = d3
.withColumnRenamed("date", "dateC")
.withColumnRenamed("milliSec", "milliSecC")
.withColumnRenamed("mid", "midC")
.withColumnRenamed("time", "timeC")
.withColumnRenamed("binImbalance", "binImbalanceC")
d1.join(d2, d1("date") === d2("dateC") and
d1("time") === d2("timeC") and
d1("mid") === d2("midC")
)
.groupBy("date", "time", "mid", "binImbalance")
.agg(collect_list("timeB"),collect_list("midB"))
。
同时,如果我重命名了: Reference 'timeB' is ambiguous, could be: timeB#16, timeB#35
列之一,我将无法收集列表中的值。
示例结果应为:
timeB
结果:
+-----+---------+------+------------+---------+------+
| date| time| mid|binImbalance| timeB| midB|
+-----+---------+------+------------+---------+------+
| 1 | 1 | 10 | 1| 4 | 10 |
| 2 | 2 | 20 | 2| 5 | 11 |
| 3 | 3 | 30 | 3| 6 | 12 |
+-----+---------+------+------------+---------+------+
| date| time| mid|binImbalance| timeB| midB|
+-----+---------+------+------------+---------+------+
| 1 | 1 | 10 | 1| 7 | 13 |
| 2 | 2 | 20 | 2| 8 | 14 |
| 3 | 3 | 30 | 3| 9 | 15 |
最小,完整且可验证的示例
+-----+---------+------+------------+---------+-----------+
| date| time| mid|binImbalance| ListTime| ListMid |
+-----+---------+------+------------+---------+-----------+
| 1 | 1 | 10 | 1| [4,7] | [10,13] |
| 2 | 2 | 20 | 2| [5,8] | [11,14] |
| 3 | 3 | 30 | 3| [6,9] | [12,15] |
答案 0 :(得分:1)
最小示例的解决方案:
import org.apache.spark.sql.functions.udf
val aggregateDataFrames = udf( (x: Double, y: Double) => Seq(x,y))
val d3 = d2.withColumnRenamed("id","id3")
.withColumnRenamed("data","data3")
val joined = d1.join(d3, d1("id") === d3("id3"))
val result = joined
.withColumn("list", aggregateDataFrames(joined("data"),joined("data3")))
.select("id","list")