鉴于数据框df_a
和df_b
,我如何获得与左边排除连接相同的结果:
SELECT df_a.*
FROM df_a
LEFT JOIN df_b
ON df_a.id = df_b.id
WHERE df_b.id is NULL
我试过了:
df_a.join(df_b, df_a("id")===df_b("id"), "left")
.select($"df_a.*")
.where(df_b.col("id").isNull)
我从上面得到一个例外:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
答案 0 :(得分:2)
如果您希望通过数据框执行此操作,请尝试以下示例:
import sqlContext.implicits._
val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
val df2 = sc.parallelize(List("a", "b")).toDF("key2")
import org.apache.spark.sql.functions._
df1.join(df2,
df1.col("key1") <=> df2.col("key2"),
"left")
.filter(col("key2").isNull)
.show
你会得到输出:
+----+----+
|key1|key2|
+----+----+
| c|null|
+----+----+
答案 1 :(得分:0)
您可以尝试自己执行SQL查询 - 保持简单..
df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
LEFT JOIN TableB B \
ON A.id = B.id \
WHERE B.id is NULL ")