Spark SQL 1.5.2:左边不包括join

时间:2017-04-10 23:10:53

标签: left-join apache-spark-sql apache-spark-1.5

鉴于数据框df_adf_b,我如何获得与左边排除连接相同的结果:

SELECT df_a.*
FROM df_a
  LEFT JOIN df_b
    ON df_a.id = df_b.id
WHERE df_b.id is NULL

我试过了:

df_a.join(df_b, df_a("id")===df_b("id"), "left")
  .select($"df_a.*")
  .where(df_b.col("id").isNull)

我从上面得到一个例外:

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()

2 个答案:

答案 0 :(得分:2)

如果您希望通过数据框执行此操作,请尝试以下示例:

  import sqlContext.implicits._
  val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
  val df2 = sc.parallelize(List("a", "b")).toDF("key2")

  import org.apache.spark.sql.functions._

  df1.join(df2,
    df1.col("key1") <=> df2.col("key2"),
    "left")
    .filter(col("key2").isNull)
    .show

你会得到输出:

+----+----+
|key1|key2|
+----+----+
|   c|null|
+----+----+

答案 1 :(得分:0)

您可以尝试自己执行SQL查询 - 保持简单..

df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
                          LEFT JOIN TableB B \
                          ON A.id = B.id \
                          WHERE B.id is NULL ")