如何根据列值是否在Spark DataFrame的一组字符串中过滤行

时间:2015-07-14 01:37:41

标签: scala apache-spark apache-spark-sql

是否有一种更优雅的过滤方式,基于一组字符串中的值?

def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
  val containsAction = udf((action: String) => {
    actions.contains(action)
  })

  myDF.filter(containsAction('action))
}

在SQL中你可以做到

select * from myTable where action in ('action1', 'action2', 'action3')

1 个答案:

答案 0 :(得分:25)

这个怎么样:

myDF.filter("action in (1,2)")

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))

Additional support will be added to make this cleaner in 1.5