如何将数据框列与pyspark中的另一个数据框列进行比较?

时间:2019-11-25 05:45:00

标签: python dataframe apache-spark pyspark col

# DataframeA and DataframeB match:
DataframeA:
col: Name "Ali", "Bilal", "Ahsan"

DataframeB:
col: Name "Ali", "Bilal", "Ahsan"

# DataframeC and DataframeD DO NOT match:  
DataframeC:
col: Name "Ali", "Ahsan", "Bilal"

DataframeD:
col: Name "Ali", "Bilal", "Ahsan"

我想匹配列值,将不胜感激。

2 个答案:

答案 0 :(得分:0)

使用下面的Scala代码作为参考,并将其转换为python。根据您的val check名称更新dataframe行。

    scala> val w = Window.orderBy(lit(1))
    scala> val check  = dfA.withColumn("rn", row_number.over(w)).alias("A").join(dfB.withColumn("rn", row_number.over(w)).alias("B"), List("rn"),"left").withColumn("check", when(col("A.name") === col("B.name"), lit("match")).otherwise(lit("not match"))).select("check").distinct.count

    scala> if (check == 1){
     | println("matched")} else (println("not matched"))

答案 1 :(得分:0)

使用set在python中进行比较。

DataframeC.columns
-> ["Ali", "Ahsan", "Bilal"]
DataframeD.columns
-> ["Ali", "Bilal", "Ahsan"]

DataframeC.columns == DataframeD.columns
-> False

set(DataframeC.columns) == set(DataframeD.columns)
-> True