Python Spark根据主键查找列差异

时间:2018-12-27 05:04:35

标签: python apache-spark pyspark

我有一个DF1,

df1 = sc.parallelize([(1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)]).toDF(["primary_key", "book", "number"])

enter image description here

和DF2

df2 = sc.parallelize([(1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)]).toDF(["primary_key", "book", "number"])

enter image description here

from pyspark.sql import functions
columlist = sc.parallelize(["book", "number"])

结果将是(垂直方向)

[![enter image description here][3]][3]

enter image description here

如何在python spark中实现这一目标?

2 个答案:

答案 0 :(得分:0)

这里是PySpark的解决方案。请注意,我必须将number转换为String,因为在结果{{1}中,列datatypesdataframe1不能有两个不同的dataframe2 }-

DataFrame

答案 1 :(得分:-1)

我已经在scala中做到了。希望对您有所帮助。

val joinDF = df1.join(df2, df1("primary_key") === df2("primary_key"), "full")
  .select(when(df1("primary_key").isNotNull, df1("primary_key")).otherwise(df2("primary_key")).as("primary_key"),
    explode(array(
      map(lit("book"),array(df1("book"), df2("book"))).as("book"),
      map(lit("number"),array(df1("number").cast("string"), df2("number").cast("string"))).as("number")
    )).as("item")
  ).select(col("primary_key"), explode($"item"))
    .select(col("primary_key"),
      col("key").as("diff_column_name"),
      col("value").getItem(0).as("dataframe1"),
      col("value").getItem(1).as("dataframe2")
    ).filter(col("dataframe1").isNull.or(col("dataframe2").isNull).or(col("dataframe1") =!= col("dataframe2")))

这是结果。

+-----------+----------------+----------+----------+ |primary_key|diff_column_name|dataframe1|dataframe2| +-----------+----------------+----------+----------+ |2 |book |book2 |book8 | |2 |number |2 |8 | |3 |number |3 |7 | |4 |book |book4 |null | |4 |number |4 |null | |5 |book |null |book5 | |5 |number |null |5 | +-----------+----------------+----------+----------+