计算满足条件的组中元素的数量

时间:2018-02-15 10:40:23

标签: r apache-spark group-by tidyverse sparklyr

我想按特定列对 sparklyr 表的行进行分组,并计算满足特定条件的行。

例如,在以下钻石表中,我想group_by颜色,并计算价格> 400的行数。

> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)

> diamonds = copy_to(con, diamonds)
> diamonds
# Source:   table<diamonds> [?? x 10]
# Database: spark_connection
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
 2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
 3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
 5 0.310 Good      J     SI2      63.3  58.0   335  4.34  4.35  2.75
 6 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
 7 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
 8 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
 9 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
10 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39

这是我在普通R中会以多种方式完成的任务。但是没有一个在sparklyr中起作用。

例如:

 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))

这适用于传统的数据框架:

# A tibble: 7 x 3
  color     n n_expensive
  <ord> <int>       <int>
1 D      6775        6756
2 E      9797        9758
3 F      9542        9517
4 G     11292       11257
5 H      8304        8274
6 I      5422        5379
7 J      2808        2748

但不是火花:

diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;

Error in eval_bare(call, env) : object 'price' not found

2 个答案:

答案 0 :(得分:2)

可能与该类型存在冲突。将逻辑转换为integer可以解决问题

library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
library(ggplot2)
data(diamonds)
diamonds1 = copy_to(con, diamonds)
diamonds1 %>% 
         group_by(color) %>%
         summarise(n=n(), n_expensive = sum(as.integer(price > 400)))

-output

enter image description here

答案 1 :(得分:2)

你必须在这里考虑SQL表达式,例如if_else

diamonds_sdl %>% group_by(color) %>% 
  summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))
带投射的

sum

diamonds_sdl %>% group_by(color) %>%
   summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))