Question

我有一个大型的Spark DataFrame，在groupBy-count操作之后，可以获得我的数据集摘要，如下所示：

myResultDF

+---+---+-----+----------+
|SEX|AGE|count|    result|
+---+---+-----+----------+
|  1|  4| 1420| 0.2665724|
|  2|  8|  801|0.32442601|
|  1|  1| 2123| 0.2259348|
|  2|  3| 1329| 0.2732647|
|  2|  2| 1224|0.28158098|
|  1|  2| 1295|0.27588340|
|  2|  6| 1063| 0.2958312|
+---+---+-----+----------+

现在，我想在0到1之间生成结果的直方图，同时考虑count列。

到目前为止，关注this link我可以在不考虑count列的情况下创建结果的直方图：

val histogramX = (0 to 10 toArray).map({case (x: Int) => x.toDouble / 10})

val histogramY = myResultDF
      .select("result")
      .map(value => value.getDouble(0))
      .rdd.histogram(histogramX, true)

这段代码只给出了一个直方图，其中六次出现在0.2到0.3之间，一次出现在0.3到0.4之间。

但我想要的是（1420 + 2123 + 1329 + 1224 + 1295 + 1063）出现在0.2和0.3之间，801出现在0.3和0.4之间。

非常感谢对此计算的任何建议：）

Answer 1

我没有使用rdd.histogram来解决我的问题。由于我的直方图的值介于0和1之间，并且分箱按10的顺序均匀分布，我只需通过(floor($"result" * histoBins) / histoBins)截断结果列并执行groupBy-count来自原始数据框。

在groupBy计数后获取Spark Dataframe直方图

1 个答案: