我是Spark / Scala的新手。 我有一个主数据框,包含超过1亿条记录
+--------+
| ttm_id|
+--------+
|39622109|
|39622178|
|39578322|
+--------+
一个包含大约4000万条记录的变更列表DataFrame
+----------+--------+
|__change__| ttm_id|
+----------+--------+
| DELETE|18001570|
| DELETE| 50520|
| DELETE| 144440|
| DELETE| 93130|
| DELETE| 93140|
+----------+--------+
我将如何比较这两个数据框,以便:
如果 __更改__ = DELETE并且masterlist.ttm_id = changeset.ttm_id,则从主列表中删除匹配的ttm_id记录
谢谢!
答案 0 :(得分:1)
我喜欢@ MaxU使用except
的解决方案。这是使用left_anti
加入的另一种方法:
master.join( changelist.where($"__change__" === "DELETE"),
Seq("ttm_id"), "left_anti"
)
请注意,对于大型DataFrame,这种方法可能很昂贵。
答案 1 :(得分:0)
IIUC您可以使用以下查询来执行此操作:
select * from masterlist
where not exists (select 1 from changeset
where masterlist.ttm_id = changeset.ttm_id
and masterlist.__change__='DELETE');
演示:
scala> m.show
+--------+
| ttm_id|
+--------+
|39622109|
|39622178|
|39578322|
+--------+
scala> c.show
+----------+--------+
|__change__| ttm_id|
+----------+--------+
| DELETE|39622109|
| DELETE| 50520|
+----------+--------+
scala> val q="""
| select * from masterlist
| where not exists (select ttm_id from changeset
| where masterlist.ttm_id = changeset.ttm_id
| and changeset.__change__='DELETE')
| """
q: String =
"
select * from masterlist
where not exists (select ttm_id from changeset
where masterlist.ttm_id = changeset.ttm_id
and changeset.__change__='DELETE')
"
scala> val res = spark.sql(q)
res: org.apache.spark.sql.DataFrame = [ttm_id: int]
scala> res.show
+--------+
| ttm_id|
+--------+
|39622178|
|39578322|
+--------+
另一种解决方案:
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> m.withColumn("__change__", lit("DELETE")).except(c.select("ttm_id","__change__")).select("ttm_id").show
+--------+
| ttm_id|
+--------+
|39578322|
|39622178|
+--------+
答案 2 :(得分:0)
广播较小的数据帧应该有助于减少加入数据帧所需的 shuffle 。
广播join
数据框后,您可以使用filter
,drop
和changedset
来获得所需的结果
val broadcastedMasterList = sc.broadcast(changeset)
masterlist.join(broadcastedMasterList.value, Seq("ttm_id"), "left")
.filter($"__change__".isNull || $"__change__" =!= "DELETE")
.drop("__change__")
.show(false)
我希望答案很有帮助。