Question

我已经从mtcars创建了一个数据框。我按gear和cyl分组。然后我计算hp和disp的最大值。因为那里应该有8个小组，所以小组出了问题。虽然我只有6组。

library(sparkR)
xx=as.DataFrame(sqlContext, data = mtcars)

head(agg(groupBy(xx, "gear", "cyl"), hp = 'max'))
  gear cyl max(hp)
1    3   8     245
2    5   4     113
3    3   4      97
4    4   4     109
5    5   6     175
6    3   6     110

更新1：

我还有另一个问题，在groupby的文档中我们有一个例子：

## Examples

## Not run: 
  # Compute the average for all numeric columns grouped by department.
  avg(groupBy(df, "department"))

  # Compute the max age and average salary, grouped by department and gender.
  agg(groupBy(df, "department", "gender"), salary="avg", "age" -> "max")

## End(Not run)

同样对于mtcars我想出了

agg(groupBy(xx, "gear", "cyl"), qsec ="avg", "disp" -> "max")

首先，我的理解是我们得到disp的最大值，但代码似乎不起作用。它给出了如下错误。第二件事是代码使用=代替->。那么是否有拼写错误。

unable to find an inherited method for function ‘groupBy’ for signature ‘"function"’

我的SparkR版本为SparkR_1.6.1。

Answer 1

您的汇总很好，但您正在添加一个＆＃39; head＆＃39;首先，它将显示前6行。您需要通过收集来替换它。像这样：

df <- as.DataFrame(mtcars)
gp = agg(groupBy(df, df$gear, df$cyl), hp = 'max')
collect(gp)

只是一句话，我正在使用spark 2.0.2

在sparkR中的Groupby没有给出期望的结果

1 个答案: