Question

我的数据以（大致）

的形式存储在Spark数据框中

Col1 Col2

A1   -5
B1   -20
C1   7
A2   3
B2   -4
C2   17

我想把它变成：

Col3 Col4

A    2
B   -24
C    24

（为A添加数字并将X1和X1连接到X中）

如何使用数据框架API执行此操作？

编辑：

col1值实际上是任意字符串（端点），我希望将其连接到一列（span），可能采用“A1-A2”形式。我计划将端点映射到Map中的其他端点并在我的UDF中查询它。我的UDF可以返回None吗？ - 假设我不想在A中包含col3，但我确实想要包含B和C，我可以在您的示例中添加其他案例吗？以便在将col1映射到col3时跳过A行？

Answer 1

您可以简单地提取组列并将其用作聚合组。假设您的数据遵循示例中的模式：

使用原始SQL：

case class Record(Col1: String, Col2: Int)

val df = sqlContext.createDataFrame(Seq(
    Record("A1", -5),
    Record("B1", -20),
    Record("C1", 7),
    Record("A2", 3),
    Record("B2", -4),
    Record("C2", 17)))

df.registerTempTable("df")

sqlContext.sql(
    """SELECT col3, sum(col2) AS col4 FROM (
        SELECT col2, SUBSTR(Col1, 1, 1) AS col3 FROM df
    ) tmp GROUP BY col3""").show

+----+----+
|col3|col4|
+----+----+
|   A|  -2|
|   B| -24|
|   C|  24|
+----+----+

使用Scala API：

import org.apache.spark.sql.functions.{udf, sum}

val getGroup = udf((s: String) => s.substring(0, 1))

df
  .select(getGroup($"col1").alias("col3"), $"col2")
  .groupBy($"col3")
  .agg(sum($"col2").alias("col4"))

+----+----+
|col3|col4|
+----+----+
|   A|  -2|
|   B| -24|
|   C|  24|
+----+----+

如果组模式更复杂，您只需调整SUBSTR或getGroup功能即可。例如：

val getGroup = {
  val pattern = "^[A-Z]+".r
    udf((s: String) => pattern.findFirstIn(s) match {
      case Some(g) => g
      case None => "Unknown"
  })
}

修改：

如果您想忽略某些组，只需添加WHERE子句即可。使用原始SQL它很简单，但使用Scala API需要付出一些努力：

 import org.apache.spark.sql.functions.{not, lit}

 df
   .select(...) // As before
   .where(not($"col3".in(lit("A"))))
   .groupBy(...).agg(...) // As before

如果要丢弃多个列，可以使用varargs：

val toDiscard = List("A", "B").map(lit(_))

df
    .select(...)
    .where(not($"col3".in(toDiscard: _*)))
    .groupBy(...).agg(...) // As before

我的UDF可以返回None吗？

它不能但是它可以返回null：

val getGroup2 = udf((s: String) => s.substring(0, 1) match {
    case x if x != "A" => x
    case _ => null: String
})

 df
   .select(getGroup2($"col1").alias("col3"), $"col2")
   .where($"col3".isNotNull)
   .groupBy(...).agg(...) // As before

+----+----+
|col3|col4|
+----+----+
|   B| -24|
|   C|  24|
+----+----+

如何使用计算组聚合数据

1 个答案: