当条件条件为spark sql

时间:2018-12-20 21:49:41

标签: scala apache-spark

我想应用spark数据框的groupBy操作中的条件。如果满足第一个条件,则选择列“ A”,否则选择给定数据帧的列“ B”

将单列返回到groupBy列条件比较容易。

例如

df.groupBy(when(col("name") === "a",col("city")).otherwise(col("country"))).agg(lit("Individual").alias("level")).show

以上代码为我提供了结果。但是如果我想根据if条件返回多列,那就失败了

我的代码:

val df = Seq(
  ("a", "abcdef", "123" ,"def", "uyhiu"),
  ("a", "7yjbb", "345" ,"hgh", "hjjhj"),
  ("d", "sbkbnn", "456","gyu", "hghj" )
).toDF("name", "email", "phone", "city", "country")

   val list1 = Array("phone", "city")
   val list2 = Array("phone", "country")

df.groupBy(when(col("name") === "a",list1.map(col): _*).otherwise(list2.map(col):_*)).agg(lit("Individual").alias("level")).show

但是我遇到了错误:

  

:52:错误:否:在此允许_ *'注释(例如   仅在 -parameters的参数中允许使用注释)          df.groupBy(when(col(“ name”)===“ a”,list1.map(col):_ )。否则(list2.map(col):_ ))。agg (lit(“ Individual”)。alias(“ level”))。show                                                            ^:52:错误:否:_ '此处允许使用注释(例如   仅在 -parameters的参数中允许使用注释)          df.groupBy(when(col(“ name”)===“ a”,list1.map(col):_ )。否则(list2.map(col):_ *))。agg(lit (“ Individual”)。alias(“ level”))。show

2 个答案:

答案 0 :(得分:0)

您必须将when表达式应用于两列:

df.groupBy(
  when(col("name") === "a", col("phone")).otherwise(col("city")),
  when(col("name") === "a", col("phone")).otherwise(col("country"))
)

当然,您可以使用一些收集操作来预构建它们:

val names = Vector(("phone", "city"), ("phone", "country"))

val columns = names.map {
  case (ifTrue, ifFalse) =>
    when(col("name") === "a", col(ifTrue)).otherwise(col(ifFalse))
}

df.groupBy(columns: _*)

答案 1 :(得分:0)

在我看来,您使用的方法不正确。您不能为每条记录动态更改groupBy子句的列名。它可以是某些表达式的结果,但不能操纵列名本身。您可以使用过滤器,并在稍后进行并集。

scala> val df = Seq(
     |   ("a", "abcdef", "123" ,"def", "uyhiu"),
     |   ("a", "7yjbb", "345" ,"hgh", "hjjhj"),
     |   ("d", "sbkbnn", "456","gyu", "hghj" )
     | ).toDF("name", "email", "phone", "city", "country")
df: org.apache.spark.sql.DataFrame = [name: string, email: string ... 3 more fields]

scala>  val list1 = Array("phone", "city")
list1: Array[String] = Array(phone, city)

scala> val list2 = Array("phone", "country")
list2: Array[String] = Array(phone, country)

scala> val df1 = df.filter("name='a'").groupBy(list1.map(col(_)):_*).agg(lit("Individual").alias("level"))
df1: org.apache.spark.sql.DataFrame = [phone: string, city: string ... 1 more field]

scala> val df2 = df.filter("name!='a'").groupBy(list2.map(col(_)):_*).agg(lit("Individual").alias("level"))
df2: org.apache.spark.sql.DataFrame = [phone: string, country: string ... 1 more field]

scala> df1.union(df2).show
+-----+----+----------+
|phone|city|     level|
+-----+----+----------+
|  345| hgh|Individual|
|  123| def|Individual|
|  456|hghj|Individual|
+-----+----+----------+


scala>