PySpark将两个或多个数据框与条件结合起来

时间:2020-07-03 11:15:21

标签: python dataframe pyspark apache-spark-sql

说我有多个带有以下架构的spark数据帧df1,df2,df3

--- X (float)
--- Y (float)
--- id (String)

现在我要合并所有这些,以便

  • 如果df1.X == df2.X and df1.Y == df2.Y then concat(df1.id,df2.id)
  • 将此列作为结果df中的一行
  • 否则将两者都作为不同的行放在结果df

有没有办法在pyspark中使用联接或lambda?

1 个答案:

答案 0 :(得分:0)

试试这个-

加载测试数据

 val df1 = spark.range(4).withColumn("x", row_number().over(Window.orderBy("id")) * lit(1f))
    df1.show(false)
    /**
      * +---+---+
      * |id |x  |
      * +---+---+
      * |0  |1.0|
      * |1  |2.0|
      * |2  |3.0|
      * |3  |4.0|
      * +---+---+
      */
    val df2 = spark.range(2).withColumn("x", row_number().over(Window.orderBy("id")) * lit(1f))
    df2.show(false)
    /**
      * +---+---+
      * |id |x  |
      * +---+---+
      * |0  |1.0|
      * |1  |2.0|
      * +---+---+
      */

合并常见和罕见记录

    val inner = df1.join(df2, Seq("x"))
      .select(
        $"x", concat(df1("id"), df2("id")).as("id")
      )
    val commonPlusUncommon =
      df1.join(df2, Seq("x"), "leftanti")
        .unionByName(
          df2.join(df1, Seq("x"), "leftanti")
        ).unionByName(inner)
    commonPlusUncommon.show(false)

    /**
      * +---+---+
      * |x  |id |
      * +---+---+
      * |3.0|2  |
      * |4.0|3  |
      * |1.0|00 |
      * |2.0|11 |
      * +---+---+
      */

您也可以使用完全外部联接

 df1.join(df2, Seq("x"), "full")
      .select(
        $"x",
        concat(coalesce(df1("id"), lit("")), coalesce(df2("id"), lit(""))).as("id")
      )
      .show(false)

    /**
      * +---+---+
      * |x  |id |
      * +---+---+
      * |1.0|00 |
      * |2.0|11 |
      * |3.0|2  |
      * |4.0|3  |
      * +---+---+
      */