Spark:基于多列过滤

时间:2018-03-21 22:46:33

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

我有两个数据集如下,并希望根据dataset2(ds2)中的列值过滤dataset1(ds1)。 也就是说,我需要根据键(ID,NAME,ACCID,CURR)的组合过滤dataset1,并且AMT值应该在dataset2中为ZERO。

val ds1 = Seq(
    ("T1001","WELLINGTON","991","CAD",150),("T1002","WELLINGTON","8787","CAD",1450),("T1002","WELLINGTON","8787","YEN",3000),
    ("T1002","WELLINGTON","8787","USD",500),("T1003","WELLINGTON","654","USD",200),("T1003","WELLINGTON","654","INR",3500),
    ("T1003","WELLINGTON","654","YEN",3500),("T1004","WELLINGTON","7567","YEN",9765),("T1005","WELLINGTON","1456","AUD",8888),
    ("T1005","WELLINGTON","1456","GBP",666),("T1005","WELLINGTON","1456","YEN",5500)
).toDF("ID","NAME","ACCID","CURR","AMT")

+-----+----------+-----+----+----+
|   ID|      NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON|  991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| YEN|3000|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON|  654| USD| 200|
|T1003|WELLINGTON|  654| INR|3500|
|T1003|WELLINGTON|  654| YEN|3500|
|T1004|WELLINGTON| 7567| YEN|9765|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
|T1005|WELLINGTON| 1456| YEN|5500|
+-----+----------+-----+----+----+

val ds2 = Seq(
    ("T1001","WELLINGTON","991","CAD",150),("T1002","WELLINGTON","8787","CAD",1450),("T1002","WELLINGTON","8787","YEN",0),
    ("T1002","WELLINGTON","8787","USD",500),("T1003","WELLINGTON","654","USD",200),("T1003","WELLINGTON","654","GBP",0),
    ("T1003","WELLINGTON","654","YEN",0),("T1004","WELLINGTON","7567","YEN",0),("T1005","WELLINGTON","1456","AUD",8888),
    ("T1005","WELLINGTON","1456","GBP",666),("T1005","WELLINGTON","1456","YEN",0)
).toDF("ID","NAME","ACCID","CURR","AMT")

+-----+----------+-----+----+----+
|   ID|      NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON|  991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| YEN|   0|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON|  654| USD| 200|
|T1003|WELLINGTON|  654| GBP|   0|
|T1003|WELLINGTON|  654| YEN|   0|
|T1004|WELLINGTON| 7567| YEN|   0|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
|T1005|WELLINGTON| 1456| YEN|   0|
+-----+----------+-----+----+----+

我正在尝试使用JOIN和FILTER,但想知道是否还有其他更好的方法 因为我只需要过滤几千条记录。 数据集2上的ZERO记录可能很小(如约5000行或更少),而数据集1行可能是数百万。

ds1.join(ds2, Seq("ID","NAME","ACCID","CURR"), "leftouter")
.select(ds1("ID"),ds1("NAME"),ds1("ACCID"),ds1("CURR"),ds1("AMT"),ds2("AMT").name("AMT2"))
.filter(expr("NVL(AMT2,-99) != 0")).drop("AMT2")
.show()

+-----+----------+-----+----+----+
|   ID|      NAME|ACCID|CURR| AMT|
+-----+----------+-----+----+----+
|T1001|WELLINGTON|  991| CAD| 150|
|T1002|WELLINGTON| 8787| CAD|1450|
|T1002|WELLINGTON| 8787| USD| 500|
|T1003|WELLINGTON|  654| USD| 200|
|T1003|WELLINGTON|  654| INR|3500|
|T1005|WELLINGTON| 1456| AUD|8888|
|T1005|WELLINGTON| 1456| GBP| 666|
+-----+----------+-----+----+----+

0 个答案:

没有答案