比较来自两个不同 pyspark 数据帧的两个不同列

时间:2021-07-21 05:23:18

标签: python pyspark apache-spark-sql

我正在尝试比较两个不同数据框中的两个不同列,如果找到匹配项,我将返回值 1,否则 -

df1 =

enter image description here

df2 =

enter image description here

df1 (Expected_Output) =

enter image description here

我已经尝试了下面的代码 -

def getImpact(row):
match = df2.filter(df2.second_key == row)
if match.count() > 0:
    return 1
return None

udf_sol = udf(lambda x: getImpact(x), IntegerType())
df1 = df1.withcolumn('impact',udf_sol(df1.first_key))

但低于错误 - TypeError:无法pickle '_thread.RLock' 对象

谁能帮我实现如上所示的预期输出?

谢谢

2 个答案:

答案 0 :(得分:0)

将 numpy 导入为 np

df1['final']= np.where(df1['first_key']==df2['second_key'],'1','None')

答案 1 :(得分:0)

假设 first_keysecond_key 是唯一的,您可以选择跨数据框连接 -

可以找到更多示例和解释here

from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F

from pyspark.sql import Window


data_list1 = [
    ("abcd","Key1")
    ,("jkasd","Key2")
    ,("oigoa","Key3")
    ,("ad","Key4")
    ,("bas","Key5")
    ,("lkalsjf","Key6")
    ,("bsawva","Key7")
]

data_list2 = [
    ("cashj","Key1",10)
    ,("ax","Key11",12)
    ,("safa","Key5",21)
    ,("safasf","Key6",78)
    ,("vasv","Key3",4)
    ,("wgaga","Key8",0)
    ,("saasfas","Key7",10)
]

sparkDF1 = sql.createDataFrame(data_list1,['data','first_key'])
sparkDF2 = sql.createDataFrame(data_list2,['temp_data','second_key','frinks'])


>>> sparkDF1
+-------+---------+
|   data|first_key|
+-------+---------+
|   abcd|     Key1|
|  jkasd|     Key2|
|  oigoa|     Key3|
|     ad|     Key4|
|    bas|     Key5|
|lkalsjf|     Key6|
| bsawva|     Key7|
+-------+---------+

>>> sparkDF2
+---------+----------+------+
|temp_data|second_key|frinks|
+---------+----------+------+
|    cashj|      Key1|    10|
|       ax|     Key11|    12|
|     safa|      Key5|    21|
|   safasf|      Key6|    78|
|     vasv|      Key3|     4|
|    wgaga|      Key8|     0|
|  saasfas|      Key7|    10|
+---------+----------+------+

#### Joining the dataframes on common columns 
finalDF = sparkDF1.join(
                sparkDF2
             ,(sparkDF1['first_key'] == sparkDF2['second_key'])
            ,'left'
).select(sparkDF1['*'],sparkDF2['frinks']).orderBy('frinks')


### Identifying impact if the frinks value is Null or Not
finalDF = finalDF.withColumn('impact',F.when(F.col('frinks').isNull(),0).otherwise(1))

>>> finalDF.show()

+-------+---------+------+------+
|   data|first_key|frinks|impact|
+-------+---------+------+------+
|  jkasd|     Key2|  null|     0|
|     ad|     Key4|  null|     0|
|  oigoa|     Key3|     4|     1|
|   abcd|     Key1|    10|     1|
| bsawva|     Key7|    10|     1|
|    bas|     Key5|    21|     1|
|lkalsjf|     Key6|    78|     1|
+-------+---------+------+------+


相关问题