从另一个DataFrame中删除不包含字符串的DataFrame行

时间:2019-07-24 16:06:39

标签: scala apache-spark

假设我有两个spark DataFrames df1和df2:

Text:            Date:

LongStringID1    2019-01-01
LongStringID2    2019-01-01
LongStringID3    2019-01-01
LongID4String    2019-01-01


ID:

ID2
ID4

在这种情况下,我想获得一个新的DF,其记录中包含来自df2的文本:

Text:            Date:

LongStringID2    2019-01-01
LongID4String    2019-01-01

如何在Scala中实现此功能?

1 个答案:

答案 0 :(得分:1)

df1设置:

val df1 = Seq(("LongStringID1","2019-01-01"),("LongStringID2","2019-02-01"), ("LongID4String","2019-01-01"),("LongID39String","2019-02-01")).toDF("text","dt")

df1.registerTempTable("tbl_df1")

df2设置:

val df2 = Seq(("ID2"),("ID3")).toDF("id") df2.registerTempTable("tbl_df2")

逻辑:

spark.sql("select t1.* from tbl_df1 t1 inner join tbl_df2 t2 where t2.id=regexp_extract(t1.text,'ID*[\\\\d]+',0)").show