如何在Apache spark中加入数据框

时间:2016-05-19 15:50:20

标签: sql scala apache-spark hive apache-spark-sql

我有以下两个数据帧:

DF1

uid   text   frequency
11    a      1
12    a      2
12    b      1

DF2

text
a
b
c
d

我想创建一个类似这样的数据框:

输出df

uid  text  frequency
11   a     1
11   b     0
11   c     0
11   d     0
12   a     2
12   b     1
12   c     0
12   d     0

我一直在使用spark-sql来编写这样的连接:

 sqlContext.sql("Select uid,df2.text,frequency from df1  right outer join df2 on df1.text= df2.text") 

,但不会返回正确的结果。

有任何建议如何去做?

1 个答案:

答案 0 :(得分:4)

你必须做这样的事情

// Find unique combinations of uid and text
df1.select("uid").distinct.join(df2.distinct)  
  // Left join with df1
  .join(df1, Seq("uid", "text"), "leftouter")
  // Replace missing values with 0
  .withColumn("frequency", coalesce($"frequency", lit(0)))

大致相当于以下SQL:

WITH tmp AS (SELECT DISTINCT df1.uid, df2.text FROM df1  JOIN df2)
SELECT tmp.uid, tmp.text, COALESCE(df1.frequency, 0) AS frequency
FROM tmp LEFT OUTER JOIN df1
ON tmp.uid = df1.uid AND tmp.text = df1.text