我正在进行内部联接,以便在火花数据框中类似地修改sql查询
SELECT DISTINCT a.aid,a.DId,a.BM,a.BY,b.TO FROM GetRaw a
INNER JOIN DF_SD b WHERE a.aid = b.aid AND a.DId= b.DId AND a.BM= b.BM AND a.BY = b.BY"
我正在转换为
val Pr = DF_SD.select("aid","DId","BM","BY","TO").distinct()
.join(GetRaw,GetRaw.("aid") <=> DF_SD("aid")
&& GetRaw.("DId") <=> DF_SD("DId")
&& DF_SD,GetRaw.("BM") <=> DF_SD("BM")
&& DF_SD,GetRaw.("BY") <=> DF_SD("BY"))
我的输出表包含列
"aid","DId","BM","BY","TO","aid","DId","BM","BY"
任何人都可以纠正我做错的地方
答案 0 :(得分:1)
在加入后使用区别选择:
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.select("aid","DId","BM","BY","TO").distinct
答案 1 :(得分:1)
你可以按顺序提到列名,这是处理这个问题的正确方法。
请参阅https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.dropDuplicates() //optionally, if you want to drop duplicate rows from the dataframe then
Pr.show();