我有一个具有id(bigint)列的df,我需要从list(string)过滤这些ID。
+-----------+
|id |
+-----------+
| 1231|
| 1331|
| 1431|
| 1531|
| 9431|
+-----------+
val a= List(1231,5031,1331,1441,1531)
Expected o/p
+-----------+
|id |
+-----------+
| 1431|
| 9431|
+-----------+
我尝试如下
df.filter(!col(("id")).isin(a : _*))
但是它不是在过滤这些ID。你知道这是怎么回事吗?
答案 0 :(得分:0)
您需要使用udf。检查一下
scala> val df = Seq(1231,1331,1431,1531,9431).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)
scala> def udf_contain(x:Int)={
| ! a.contains(x)
| }
udf_contain: (x: Int)Boolean
scala> val myudf_contain = udf ( udf_contain(_:Int):Boolean )
myudf_contain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))
scala> df.filter(myudf_contain('id)).show
+----+
| id|
+----+
|1431|
|9431|
+----+
scala>
或RDD方式。
scala> val rdd = Seq(1231,1331,1431,1531,9431).toDF("id").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:32
scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)
scala> def udf_contain(x:Int)={
| ! a.contains(x)
| }
udf_contain: (x: Int)Boolean
scala>
scala> rdd.filter(x=>udf_contain(Row(x(0)).mkString.toInt)).collect
res29: Array[org.apache.spark.sql.Row] = Array([1431], [9431])