Question

我在sparkR中有一个DataFrame X. X包含ID = 1 2 3 1 2 3 9的列...以及每个条目的得分：得分= 1241 233 20100 ....

因此，要查找ID的所有分数

s=filter(X, X$ID==1)

然后我们得到ID 1的所有分数，我们可以得到它的总和。

我想知道X中ID = 1的数量，所以我在SparkR中使用'count'函数

count(s)

但这需要很长时间才能计算出来。有一个更好的方法吗？

假设我们已安排或排序X所以ID = 1 1 1 2 3 3 3 4 ..... 那么也许有一个更好的选择，以避免做计数。

Answer 1

By aggregating on ID and counting how many items there are, you immediately get the result for all ID's, however, with only 100000 rows it shouldn't take long at all!

countedData <- agg(groupBy(X, "ID"), count = n(X[["score"]]))

sparkR中count-function的运行时间

1 个答案: