如何基于动态列比较pyspark中的2个数据帧

时间:2019-03-28 03:08:49

标签: apache-spark pyspark apache-spark-sql

我有2个数据帧,我正在从不同来源的pyspark中处理。这些数据框有很少的共同之处。这是我需要做的,

  • 根据动态生成的列比较2个数据框
  • 对于匹配的行,根据左右两边的列值进行进一步处理
  • 还需要两个数据帧中不匹配的记录

这是我的意思,

df1 = [Row(name=Bob, sub_id=1, id=1, age=5, status=active, version=0),
Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1),
Row(name=Tom, sub_id=2, id=3, age=50, status=active, version=0)]

df2 = [Row(name=Bobbie, sub_id=1, age=5),
Row(name=Tom, sub_id=2, age=51),
Row(name=Carter, sub_id=3, age=70)]
"""
my keys are based on sub_id and say they are per below,
  sub_id = 1, keys = [sub_id]
  sub_id = 2, keys = [sub_id, age]
  sub_id = 3, keys = [sub_id]
"""
#matched records expected results
#note that only sub_id=1 has a match based on keys. After match and further processing, version 0 record in df1 was copied as version 2 and updated per df2
df_matched = [
Row(name=Bobbie, sub_id=1, id=1, age=5, status=active, version=0), #updated per df2
Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1),
Row(name=Bob, sub_id=1, id=1, age=5, status=active, version=2) #new insert
]

#unmatched from df1
df_unmatched_left = [Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1)]

#unmatched from df2
df_unmatched_right = [Row(name=Tom, sub_id=2, age=51),
Row(name=Carter, sub_id=3, age=70)]

#Here is what I have tried so far
#created temp views of df1 and df2, say df1_table, df2_table
#but how do i make the joins dynamic and also further process for which i need both df1 and df2
df_matched = spark.sql("sql to join df1_table and df2_table").select("select from both dfs")  #use some function with map for further processing ? 

#to make the join condition, would using map and adding a temp column with join condition to use in query a good approach ?

我正在使用pyspark 2.3.x和python 3.5.x

0 个答案:

没有答案