我想按特定列对 sparklyr 表的行进行分组,并计算满足特定条件的行。
例如,在以下钻石表中,我想group_by颜色,并计算价格> 400的行数。
> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)
> diamonds = copy_to(con, diamonds)
> diamonds
# Source: table<diamonds> [?? x 10]
# Database: spark_connection
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
7 0.240 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
8 0.260 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
9 0.220 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
10 0.230 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
这是我在普通R中会以多种方式完成的任务。但是没有一个在sparklyr中起作用。
例如:
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))
这适用于传统的数据框架:
# A tibble: 7 x 3
color n n_expensive
<ord> <int> <int>
1 D 6775 6756
2 E 9797 9758
3 F 9542 9517
4 G 11292 11257
5 H 8304 8274
6 I 5422 5379
7 J 2808 2748
但不是火花:
diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;
Error in eval_bare(call, env) : object 'price' not found
答案 0 :(得分:2)
可能与该类型存在冲突。将逻辑转换为integer
可以解决问题
library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
library(ggplot2)
data(diamonds)
diamonds1 = copy_to(con, diamonds)
diamonds1 %>%
group_by(color) %>%
summarise(n=n(), n_expensive = sum(as.integer(price > 400)))
-output
答案 1 :(得分:2)
你必须在这里考虑SQL表达式,例如if_else
:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))
带投射的 sum
:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))