Question

我的输入数据集看起来像ds [（T，U）]。 T和U都在下面。

ds.groupBy("key1", "key2", ...)
      .agg(
        sum("value1")).alias("value11"),
        sum("value2")).alias("value22"),
        ...
      .select("key1", "key2", ..., "value11", "value22", "fileId", ...)

聚合看起来像

{{1}}

这是最终输出。有没有更好的方法通过使用groupByKey / reduceGroups或其他方面的性能来实现相同的输出？

通过处理行生成inout数据集。我们在一行中嵌套了对象，我们遍历这些对象以从每一行中提取键和值。将这两个过程结合在一起的有效方法是什么？自定义UDAF会更好地适应这种情况吗？

Answer 1

In terms of performance this is as good as it gets. Using statically typed Dataset and groupByKey / reduceGroups can only degrade performance or at best, provide no improvement whatsoever.

使用复合键在数据集上进行聚合

1 个答案: