假设我有两个spark DataFrames df1和df2:
Text: Date:
LongStringID1 2019-01-01
LongStringID2 2019-01-01
LongStringID3 2019-01-01
LongID4String 2019-01-01
ID:
ID2
ID4
在这种情况下,我想获得一个新的DF,其记录中包含来自df2的文本:
Text: Date:
LongStringID2 2019-01-01
LongID4String 2019-01-01
如何在Scala中实现此功能?
答案 0 :(得分:1)
df1设置:
val df1 = Seq(("LongStringID1","2019-01-01"),("LongStringID2","2019-02-01"), ("LongID4String","2019-01-01"),("LongID39String","2019-02-01")).toDF("text","dt")
df1.registerTempTable("tbl_df1")
df2设置:
val df2 = Seq(("ID2"),("ID3")).toDF("id")
df2.registerTempTable("tbl_df2")
逻辑:
spark.sql("select t1.* from tbl_df1 t1 inner join tbl_df2 t2 where t2.id=regexp_extract(t1.text,'ID*[\\\\d]+',0)").show