我有两个数据集如下,并希望根据dataset2(ds2)中的列值过滤dataset1(ds1)。 也就是说,我需要根据键(ID,NAME,ACCID,CURR)的组合过滤dataset1,并且AMT值应该在dataset2中为ZERO。
val ds1 = Seq(
("T1001","WELLINGTON","991","CAD",150),("T1002","WELLINGTON","8787","CAD",1450),("T1002","WELLINGTON","8787","YEN",3000),
("T1002","WELLINGTON","8787","USD",500),("T1003","WELLINGTON","654","USD",200),("T1003","WELLINGTON","654","INR",3500),
("T1003","WELLINGTON","654","YEN",3500),("T1004","WELLINGTON","7567","YEN",9765),("T1005","WELLINGTON","1456","AUD",8888),
("T1005","WELLINGTON","1456","GBP",666),("T1005","WELLINGTON","1456","YEN",5500)
).toDF("ID","NAME","ACCID","CURR","AMT")
+-----+----------+-----+----+----+
| ID| NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON| 991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| YEN|3000|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON| 654| USD| 200|
|T1003|WELLINGTON| 654| INR|3500|
|T1003|WELLINGTON| 654| YEN|3500|
|T1004|WELLINGTON| 7567| YEN|9765|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
|T1005|WELLINGTON| 1456| YEN|5500|
+-----+----------+-----+----+----+
val ds2 = Seq(
("T1001","WELLINGTON","991","CAD",150),("T1002","WELLINGTON","8787","CAD",1450),("T1002","WELLINGTON","8787","YEN",0),
("T1002","WELLINGTON","8787","USD",500),("T1003","WELLINGTON","654","USD",200),("T1003","WELLINGTON","654","GBP",0),
("T1003","WELLINGTON","654","YEN",0),("T1004","WELLINGTON","7567","YEN",0),("T1005","WELLINGTON","1456","AUD",8888),
("T1005","WELLINGTON","1456","GBP",666),("T1005","WELLINGTON","1456","YEN",0)
).toDF("ID","NAME","ACCID","CURR","AMT")
+-----+----------+-----+----+----+
| ID| NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON| 991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| YEN| 0|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON| 654| USD| 200|
|T1003|WELLINGTON| 654| GBP| 0|
|T1003|WELLINGTON| 654| YEN| 0|
|T1004|WELLINGTON| 7567| YEN| 0|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
|T1005|WELLINGTON| 1456| YEN| 0|
+-----+----------+-----+----+----+
我正在尝试使用JOIN和FILTER,但想知道是否还有其他更好的方法 因为我只需要过滤几千条记录。 数据集2上的ZERO记录可能很小(如约5000行或更少),而数据集1行可能是数百万。
ds1.join(ds2, Seq("ID","NAME","ACCID","CURR"), "leftouter")
.select(ds1("ID"),ds1("NAME"),ds1("ACCID"),ds1("CURR"),ds1("AMT"),ds2("AMT").name("AMT2"))
.filter(expr("NVL(AMT2,-99) != 0")).drop("AMT2")
.show()
+-----+----------+-----+----+----+
| ID| NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON| 991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON| 654| USD| 200|
|T1003|WELLINGTON| 654| INR|3500|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
+-----+----------+-----+----+----+