Scala过滤器从字符串列表中过滤ID不起作用

时间:2018-10-25 14:19:39

标签: scala apache-spark

我有一个具有id(bigint)列的df,我需要从list(string)过滤这些ID。

+-----------+
|id         |
 +-----------+
|       1231|
|       1331|
|       1431|
|       1531| 
|       9431|                          
+-----------+

val a= List(1231,5031,1331,1441,1531)

Expected o/p
+-----------+
|id         |
+-----------+
|       1431|
|       9431|                          
+-----------+

我尝试如下

df.filter(!col(("id")).isin(a  : _*))

但是它不是在过滤这些ID。你知道这是怎么回事吗?

1 个答案:

答案 0 :(得分:0)

您需要使用udf。检查一下

scala> val df = Seq(1231,1331,1431,1531,9431).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]

scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)

scala> def udf_contain(x:Int)={
     | ! a.contains(x)
     | }
udf_contain: (x: Int)Boolean

scala> val myudf_contain = udf ( udf_contain(_:Int):Boolean )
myudf_contain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))

scala> df.filter(myudf_contain('id)).show
+----+
|  id|
+----+
|1431|
|9431|
+----+

scala>

或RDD方式。

scala> val rdd = Seq(1231,1331,1431,1531,9431).toDF("id").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:32

scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)

scala> def udf_contain(x:Int)={
     | ! a.contains(x)
     | }
udf_contain: (x: Int)Boolean

scala>

scala> rdd.filter(x=>udf_contain(Row(x(0)).mkString.toInt)).collect
res29: Array[org.apache.spark.sql.Row] = Array([1431], [9431])