我有以下两个数据帧:
DF1
uid text frequency
11 a 1
12 a 2
12 b 1
DF2
text
a
b
c
d
我想创建一个类似这样的数据框:
输出df
uid text frequency
11 a 1
11 b 0
11 c 0
11 d 0
12 a 2
12 b 1
12 c 0
12 d 0
我一直在使用spark-sql来编写这样的连接:
sqlContext.sql("Select uid,df2.text,frequency from df1 right outer join df2 on df1.text= df2.text")
,但不会返回正确的结果。
有任何建议如何去做?
答案 0 :(得分:4)
你必须做这样的事情
// Find unique combinations of uid and text
df1.select("uid").distinct.join(df2.distinct)
// Left join with df1
.join(df1, Seq("uid", "text"), "leftouter")
// Replace missing values with 0
.withColumn("frequency", coalesce($"frequency", lit(0)))
大致相当于以下SQL:
WITH tmp AS (SELECT DISTINCT df1.uid, df2.text FROM df1 JOIN df2)
SELECT tmp.uid, tmp.text, COALESCE(df1.frequency, 0) AS frequency
FROM tmp LEFT OUTER JOIN df1
ON tmp.uid = df1.uid AND tmp.text = df1.text