比较Spark 2 DataFrame' s

时间:2018-01-25 22:58:39

标签: scala apache-spark apache-spark-sql

我是Spark / Scala的新手。 我有一个主数据框,包含超过1亿条记录

+--------+
|  ttm_id|
+--------+
|39622109|
|39622178|
|39578322|
+--------+

一个包含大约4000万条记录的变更列表DataFrame

+----------+--------+
|__change__|  ttm_id|
+----------+--------+
|    DELETE|18001570|
|    DELETE|   50520|
|    DELETE|  144440|
|    DELETE|   93130|
|    DELETE|   93140|
+----------+--------+

我将如何比较这两个数据框,以便:

如果 __更改__ = DELETE并且masterlist.ttm_id = changeset.ttm_id,则从主列表中删除匹配的ttm_id记录

谢谢!

3 个答案:

答案 0 :(得分:1)

我喜欢@ MaxU使用except的解决方案。这是使用left_anti加入的另一种方法:

master.join( changelist.where($"__change__" === "DELETE"),
  Seq("ttm_id"), "left_anti"
)

请注意,对于大型DataFrame,这种方法可能很昂贵。

答案 1 :(得分:0)

IIUC您可以使用以下查询来执行此操作:

select * from masterlist
where not exists (select 1 from changeset
                  where masterlist.ttm_id = changeset.ttm_id
                    and masterlist.__change__='DELETE');

演示:

scala> m.show
+--------+
|  ttm_id|
+--------+
|39622109|
|39622178|
|39578322|
+--------+


scala> c.show
+----------+--------+
|__change__|  ttm_id|
+----------+--------+
|    DELETE|39622109|
|    DELETE|   50520|
+----------+--------+


scala> val q="""
     | select * from masterlist
     | where not exists (select ttm_id from changeset
     |                   where masterlist.ttm_id = changeset.ttm_id
     |                     and changeset.__change__='DELETE')
     | """
q: String =
"
select * from masterlist
where not exists (select ttm_id from changeset
                  where masterlist.ttm_id = changeset.ttm_id
                    and changeset.__change__='DELETE')
"

scala> val res = spark.sql(q)
res: org.apache.spark.sql.DataFrame = [ttm_id: int]

scala> res.show
+--------+
|  ttm_id|
+--------+
|39622178|
|39578322|
+--------+

另一种解决方案:

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> m.withColumn("__change__", lit("DELETE")).except(c.select("ttm_id","__change__")).select("ttm_id").show
+--------+
|  ttm_id|
+--------+
|39578322|
|39622178|
+--------+

答案 2 :(得分:0)

广播较小的数据帧应该有助于减少加入数据帧所需的 shuffle

广播join 数据框后,您可以使用filterdropchangedset来获得所需的结果

val broadcastedMasterList = sc.broadcast(changeset)
masterlist.join(broadcastedMasterList.value, Seq("ttm_id"), "left")
  .filter($"__change__".isNull || $"__change__" =!= "DELETE")
  .drop("__change__")
  .show(false)

我希望答案很有帮助。