说我有多个带有以下架构的spark数据帧df1,df2,df3
--- X (float)
--- Y (float)
--- id (String)
现在我要合并所有这些,以便
df1.X == df2.X and df1.Y == df2.Y then concat(df1.id,df2.id)
df
中的一行df
有没有办法在pyspark中使用联接或lambda?
答案 0 :(得分:0)
试试这个-
val df1 = spark.range(4).withColumn("x", row_number().over(Window.orderBy("id")) * lit(1f))
df1.show(false)
/**
* +---+---+
* |id |x |
* +---+---+
* |0 |1.0|
* |1 |2.0|
* |2 |3.0|
* |3 |4.0|
* +---+---+
*/
val df2 = spark.range(2).withColumn("x", row_number().over(Window.orderBy("id")) * lit(1f))
df2.show(false)
/**
* +---+---+
* |id |x |
* +---+---+
* |0 |1.0|
* |1 |2.0|
* +---+---+
*/
val inner = df1.join(df2, Seq("x"))
.select(
$"x", concat(df1("id"), df2("id")).as("id")
)
val commonPlusUncommon =
df1.join(df2, Seq("x"), "leftanti")
.unionByName(
df2.join(df1, Seq("x"), "leftanti")
).unionByName(inner)
commonPlusUncommon.show(false)
/**
* +---+---+
* |x |id |
* +---+---+
* |3.0|2 |
* |4.0|3 |
* |1.0|00 |
* |2.0|11 |
* +---+---+
*/
df1.join(df2, Seq("x"), "full")
.select(
$"x",
concat(coalesce(df1("id"), lit("")), coalesce(df2("id"), lit(""))).as("id")
)
.show(false)
/**
* +---+---+
* |x |id |
* +---+---+
* |1.0|00 |
* |2.0|11 |
* |3.0|2 |
* |4.0|3 |
* +---+---+
*/