我有这个Spark表:
$users = array("user1" => array("name" => "Loghman Avand", "age" => 26), "user2" => array("name" => "Sara Alavi", "age" => 34), "user3" => array("name" => "Hossein Ahmadi", "age" => 3));
$age_sum = 0;
foreach($users as $user){
$age_sum += $user['age'];
}
echo $age_sum;
和一个名为xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
的句柄连接到该表。
我希望xy_df
invoke
函数来计算selectExpr
,例如:
mean
也适用于所有其他列。
但是当我运行它时,会出现这个错误:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
我知道这是因为在常见的SQL规则中,我没有为聚合函数(Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
)中包含的列添加GROUP BY
子句。如何将mean
放入GROUP BY
方法?
以前,我设法用另一种方式完成目的,即:
invoke
mean
summarize_all
mean
和invoke
as explained in this answer,但现在我试图通过将所有操作放在Spark本身内来加快执行时间,而不向R检索任何内容。
我的Spark版本是1.6.0