Question

我一直在做＃34;游戏＆＃34;使用spark-sql。第一种方式是这样的：

val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")

val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)

第二个是这样的：

val gamesDf = dataframe.
  groupBy($"hero_id", $"position", $"game_version", $"server").count().
  withColumnRenamed("count", "hero_games")

val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))

对于所有意图和目的，dataframe只有hero_id，position，game_version和server列。

然而games_count1最终约为10，而games_count2最终为50.这显然这两种计数方法不等同或正在发生其他事情，但我想弄清楚：什么是这些差异的原因？

Answer 1

我猜是因为在第一个查询中，您只分组了2列，而在后4列中。因此，您可能只有两列不同的组。

Spark中这两种计数方法的区别

1 个答案: