Question

我想应用spark数据框的groupBy操作中的条件。如果满足第一个条件，则选择列“ A”，否则选择给定数据帧的列“ B”

将单列返回到groupBy列条件比较容易。

例如

df.groupBy(when(col("name") === "a",col("city")).otherwise(col("country"))).agg(lit("Individual").alias("level")).show

以上代码为我提供了结果。但是如果我想根据if条件返回多列，那就失败了

我的代码：

val df = Seq(
  ("a", "abcdef", "123" ,"def", "uyhiu"),
  ("a", "7yjbb", "345" ,"hgh", "hjjhj"),
  ("d", "sbkbnn", "456","gyu", "hghj" )
).toDF("name", "email", "phone", "city", "country")

   val list1 = Array("phone", "city")
   val list2 = Array("phone", "country")

df.groupBy(when(col("name") === "a",list1.map(col): _*).otherwise(list2.map(col):_*)).agg(lit("Individual").alias("level")).show

但是我遇到了错误：

：52：错误：否：在此允许_ *'注释（例如仅在 -parameters的参数中允许使用注释） df.groupBy（when（col（“ name”）===“ a”，list1.map（col）：_ ）。否则（list2.map（col）：_ ））。agg （lit（“ Individual”）。alias（“ level”））。show ^：52：错误：否：_ '此处允许使用注释（例如仅在 -parameters的参数中允许使用注释） df.groupBy（when（col（“ name”）===“ a”，list1.map（col）：_ ）。否则（list2.map（col）：_ *））。agg（lit （“ Individual”）。alias（“ level”））。show

Answer 1

您必须将when表达式应用于两列：

df.groupBy(
  when(col("name") === "a", col("phone")).otherwise(col("city")),
  when(col("name") === "a", col("phone")).otherwise(col("country"))
)

当然，您可以使用一些收集操作来预构建它们：

val names = Vector(("phone", "city"), ("phone", "country"))

val columns = names.map {
  case (ifTrue, ifFalse) =>
    when(col("name") === "a", col(ifTrue)).otherwise(col(ifFalse))
}

df.groupBy(columns: _*)

Answer 2

在我看来，您使用的方法不正确。您不能为每条记录动态更改groupBy子句的列名。它可以是某些表达式的结果，但不能操纵列名本身。您可以使用过滤器，并在稍后进行并集。

scala> val df = Seq(
     |   ("a", "abcdef", "123" ,"def", "uyhiu"),
     |   ("a", "7yjbb", "345" ,"hgh", "hjjhj"),
     |   ("d", "sbkbnn", "456","gyu", "hghj" )
     | ).toDF("name", "email", "phone", "city", "country")
df: org.apache.spark.sql.DataFrame = [name: string, email: string ... 3 more fields]

scala>  val list1 = Array("phone", "city")
list1: Array[String] = Array(phone, city)

scala> val list2 = Array("phone", "country")
list2: Array[String] = Array(phone, country)

scala> val df1 = df.filter("name='a'").groupBy(list1.map(col(_)):_*).agg(lit("Individual").alias("level"))
df1: org.apache.spark.sql.DataFrame = [phone: string, city: string ... 1 more field]

scala> val df2 = df.filter("name!='a'").groupBy(list2.map(col(_)):_*).agg(lit("Individual").alias("level"))
df2: org.apache.spark.sql.DataFrame = [phone: string, country: string ... 1 more field]

scala> df1.union(df2).show
+-----+----+----------+
|phone|city|     level|
+-----+----+----------+
|  345| hgh|Individual|
|  123| def|Individual|
|  456|hghj|Individual|
+-----+----+----------+


scala>

当条件条件为spark sql

2 个答案: