Question

所以我有一个值的数据框，需要将它们求和，然后以Map[String,Long]格式保存到Cassandra中。

以下代码有效，但是我想知道是否可以基于抽象的列列表创建地图。（看着source code的功能只会让我更加困惑）。

var cols = Array("key", "v1", "v2")
var df = Seq(("a",1,0),("b",1,0),("a",1,1),("b",0,0)).toDF(cols: _*)
val df1 = df.groupBy(col(cols(0))).
  agg(map(lit(cols(1)), sum(col(cols(1))), lit(cols(2)), sum(col(cols(2)))) as "map")

这是我期望的数据框格式和上面给出的代码的当前给定结果：

scala> df1.show(false)
+---+---------------------+
|key|map                  |
+---+---------------------+
|b  |Map(v1 -> 1, v2 -> 0)|
|a  |Map(v1 -> 2, v2 -> 1)|
+---+---------------------+

我希望看到一个函数，该函数可以返回与上面相同的结果，但能够根据名称以编程方式放置列。例如：

var columnNames = Array("v1", "v2")
df.groupBy(col(cols(0))).agg(create_sum_map(columnNames) as "map")

在Spark中甚至可以远程实现吗？

Answer 1

不需要使用慢速protected void UserControl_ButtonClick(object sender, CustomEventArgs e) { Button tempButton = (Button)sender; GlobalDebugMonitorControl tempParentControl = e.Control; }，您可以使用纯内置的Spark函数和varargs来实现此目的，请参见例如Spark SQL: apply aggregate functions to a list of columns。此解决方案需要构建可对其应用聚合的列的列表。在这里，这有点复杂，因为您要在最终输出中使用UDF，这需要额外的步骤。

首先创建要在聚合中使用的表达式（列）：

map

应用分组依据并使用创建的val exprs = cols.tail.flatMap(c => Seq(lit(c), sum(col(c))))：

exprs

在创建val df2 = df.groupBy(col(cols.head)).agg(exprs.head, exprs.tail:_*) .select(col(cols.head), map(cols.tail.flatMap(c => Seq(col(c), col(s"sum($c)"))):_*).as("map"))之前需要额外的select，而map只是应该添加到cols.tail.flatMap(c => Seq(col(c), col(s"sum($c)"))的新列的列表。

结果输出与之前相同：

map

Answer 2

所以我想出了如何根据@Shaido的答案生成想要的答案的结果。

def create_sum_map(cols: Array[String]): Column = 
  map(cols.flatMap(c => Seq(lit(c), sum(col(c)))):_*)

df.groupBy(col(cols.head)).agg(create_sum_map(columnNames) as "map")

我认为这是可行的，因为在sum(Column)函数的create_sum_map()中存在一个受影响的列的.agg()。

使用列名称数组中的UDF将列合并到单个映射中

2 个答案: