R:以sparklyr分组(“ sum”,“ count count”,“ mean”)

时间:2019-01-14 14:39:50

标签: r apache-spark sparklyr

假设我们在工作目录中托管以下数据:

 >library(sparklyr)
 >library(dplyr)

 >f<-data.frame(category=c("e","EE","W","S","Q","e","Q","S"), 
          DD=c(33.2,33.2,14.55,12,13.4,45,7,3),
          CC=c(2,44,4,44,9,2,2.2,4), 
 >FF=c("A","A","A","A","A","A","B","A") )

>write.csv(f,"D.csv")##Write in working directory

我们使用spark命令从工作目录中读取文件

>sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")


>df <- spark_read_csv(sc, name = "data", path = "D.csv", header = TRUE, delimiter = ",")

我想获得一个如下矩阵,其中按“类别”分组,求和DD,计算“ CC”的平均值,在“ FF”中计数不同

它会一直这样:

  category SumDD MeanCC CountDistinctFF
   e       78.2    2             1
   EE      33.2    44.           1
   WW      14.55   4             1
   S       15      24            2
   Q       20.4    5.6           1

3 个答案:

答案 0 :(得分:1)

为了操纵spark DF,您需要使用dplyr函数。在星火环境中,Naveen的答案会起作用,除了最后一个变量。可以从dplyr尝试unique来代替n_distinct

df0=df%>%group_by(category)%>%
summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))

要使用Spark DF检查结果,可以使用:

> glimpse(df0)
Observations: ??
Variables: 4
$ category        <chr> "e", "EE", "S", "Q", "W"
$ sumDD           <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
$ MeanCC          <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
$ CountDistinctFF <dbl> 1, 1, 1, 2, 1

或者您可以将其收集回本地系统并像任何R数据框一样进行操作

    > df0%>%collect
# A tibble: 5 x 4
  category sumDD MeanCC CountDistinctFF
  <chr>    <dbl>  <dbl>           <dbl>
1 e         78.2    2                 1
2 EE        33.2   44                 1
3 S         15     24                 1
4 Q         20.4    5.6               2
5 W         14.6    4                 1

答案 1 :(得分:0)

不确定您是否要从特定的程序包中寻找解决方案,可以使用dplyr程序包来实现,其中我们使用group_by的{​​{1}}列和category结果根据我们的需要。

这是示例代码。

代码:

summarise

输出:

f %>% group_by(category) %>%
  summarise(sumDD = sum(DD), MeanCC = mean(CC), CountDistinctFF = length(unique(FF)))

答案 2 :(得分:0)

作为对安东尼斯的回应的补充方式,后来出现了一个错误。经过调查,我发现软件包之间存在冲突,特别是dplyr和SparkR。

这可以通过安装tidyverse软件包并按如下所示调用命令来解决:

 >library(tidyverse)

  >df0=df%>%dplyr::group_by(category)%>%dplyr::summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))



>glimpse(df0)
 Observations: ??
 Variables: 4
 $ category        <chr> "e", "EE", "S", "Q", "W"
 $ sumDD           <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
 $ MeanCC          <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
 $ CountDistinctFF <dbl> 1, 1, 1, 2, 1