鉴于以下两个Spark Datasets
,flights
和capitals
,最有效的返回 combined (即“ joined”)结果的方法是什么而不先转换为DataFrame
或用.select()
方法按名称写出所有列?例如,我知道我可以使用(例如.map(x => x._1
)访问元组,也可以将*
运算符用于:
result.select("_1.*","_2.*")
但是后者可能会导致重复的列名,我希望有一个更干净的解决方案。
谢谢您的帮助。
case class Flights(tripNumber: Int, destination: String)
case class Capitals(state: String, capital: String)
val flights = Seq(
(55, "New York"),
(3, "Georgia"),
(12, "Oregon")
).toDF("tripNumber","destination").as[Flights]
val capitals = Seq(
("New York", "Albany"),
("Georgia", "Atlanta"),
("Oregon", "Salem")
).toDF("state","capital").as[Capitals]
val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))
答案 0 :(得分:0)
有2个选项,但是您必须使用join
而不是joinWith
:
val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))
输出:
result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
| New York| 55| Albany|
| Georgia| 3|Atlanta|
| Oregon| 12| Salem|
+-----------+----------+-------+