Question

如何计算Sparklyr中的累积总和？

dplyr：

iris %>% group_by(Species) %>% mutate(col = cumsum(Sepal.Length))

cumsum不是sparklyr中包含的功能，如何在sparklyr中重现该功能？

我认为Spark SQL将是以下内容？：

SELECT
    *,
    sum(Sepal.Length) OVER (PARTITION BY Species ORDER BY index) as col
FROM
iris

更新：cumsum是一个可以在sparklyr中使用的函数，它只需要首先调用一个排列动词即可（在本地r中这是不必要的）

iris %>% 
  sdf_copy_to %>% 
  group_by(Species) %>% 
  arrange(Sepal.Length) %>%
  mutate(col = cumsum(Sepal.Length))

Answer 1

如果知道正确的语法，则可以用sparklyr编写SQL，在这种情况下，原始SQL（假设索引为Sepal_Length）为：

SELECT * 
  , SUM(Sepal_Length) OVER (PARTITION BY Species ORDER BY Sepal_Length) AS CumSum
FROM iris

如果您想在Sparklyr中执行此操作，只需执行以下操作：

iris2 <- iris %>%
          mutate(CumSum = sql("
                 SUM(Sepal_Length) OVER (PARTITION BY Species ORDER BY Sepal_Length)
                 "))

Answer 2

问题更新中的示例代码语法无效，使用 SQL 非常麻烦。我相信以下是真正的闪亮方法：

library(tidyverse)
library(sparklyr) 
   
data("iris")

sc <- spark_connect()

iris %>% 
     sdf_copy_to(sc=sc, overwrite=T) %>% 
     group_by(Species) %>% 
     arrange(Sepal_Length) %>%
     mutate(col = cumsum(Sepal_Length)) %>%
     ungroup

计算Sparklyr中的累积和

2 个答案: