根据多个条件筛选列:Scala Spark

时间:2018-05-09 14:17:34

标签: scala apache-spark spark-dataframe

我在尝试根据多个条件过滤列中的行时遇到问题。基本上我将多个条件存储在一个数组中,我想过滤掉它们。但是,我最后一直收到错误。谁能建议一种方法来解决这个问题?以下是我尝试实现的一些示例代码:

    // Now let's filter through the ADM1 codes to select all 50 US States
val stateArray = Array("USAL", "USMD", "USCA", "USME", "USND", "USSD", "USWY", "USAK", "USWA", "USFL",
  "USGA", "USSC", "USNC", "USMA", "USNH", "USVT", "USAR", "USAZ", "USTX", "USLA", "USIL", "USOR", "USNV",
  "USID", "USMN", "USNM", "USNE", "USNJ", "USDE", "USVA", "USWV", "USTN", "USKY", "USNY", "USPA", "USIN",
  "USOH", "USHI", "USOK", "USIA", "USMI", "USMS", "USMO", "USCO", "USKS", "USUT", "USWI", "USMT", "USRI",
  "USCT")

// Let's filter through all of these conditions
val tmpDf3 = tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code") === stateArray)

// I can do this with a for loop, but I want everything in one data frame
    for(n <- stateArray) {
  val tmpDf2 = tmpDf1
    .filter(tmpDf1("Actor1Geo_ADM1Code") === n)
  tmpDf2.show(false)
  tmpDf2.printSchema()
}

1 个答案:

答案 0 :(得分:3)

使用isin

tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code").isin(stateArray: _*))

实施例

val states = Array("USAL", "USMD")
// states: Array[String] = Array(USAL, USMD)

val df = Seq((1, "USAL"), (2, "USMD"), (3, "USGA")).toDF("id", "Actor1Geo_ADM1Code")
// df: org.apache.spark.sql.DataFrame = [id: int, Actor1Geo_ADM1Code: string]

df.show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
|  1|              USAL|
|  2|              USMD|
|  3|              USGA|
+---+------------------+


df.filter(df("Actor1Geo_ADM1Code").isin(states: _*)).show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
|  1|              USAL|
|  2|              USMD|
+---+------------------+