如何按组对变量求和

时间:2009-11-02 09:01:29

标签: r sorting r-faq

假设我有两列数据。第一个包含诸如“First”,“Second”,“Third”等类别。第二个包含代表我看到“First”次数的数字。

例如:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

我想按类别对数据进行排序,并对频率求和:

Category     Frequency
First        30
Second       5
Third        34

我如何在R?

中这样做

15 个答案:

答案 0 :(得分:328)

使用aggregate

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

在上面的示例中,可以在list中指定多个维度。可以通过cbind合并相同数据类型的多个聚合指标:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(嵌入@thelatemail评论),aggregate也有一个公式界面

aggregate(Frequency ~ Category, x, sum)

或者,如果要聚合多个列,可以使用.表示法(也适用于一列)

aggregate(. ~ Category, x, sum)

tapply

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

使用此数据:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

答案 1 :(得分:171)

最近,您还可以使用 dplyr 包来实现此目的:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

或者,对于多个摘要列(也适用于一列):

x %>% 
  group_by(Category) %>% 
  summarise_each(funs(sum))

更新dplyr&gt; = 0.5: summarise_each已替换为dplyr中的summarise_allsummarise_atsummarise_if系列函数。

或者,如果您有多列要分组,,您可以在group_by中用逗号分隔所有这些列:

mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

有关详细信息,包括%>%运算符,请参阅introduction to dplyr

答案 2 :(得分:61)

rcs提供的答案很简单。但是,如果您正在处理更大的数据集并需要提高性能,则可以采用更快的替代方法:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

让我们使用data.frame和上面的内容来比较它:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

如果你想保留这个列,这就是语法:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

对于较大的数据集,差异将变得更加明显,如下面的代码所示:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

对于多个聚合,您可以合并lapply.SD,如下所示

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

答案 3 :(得分:35)

这有点related to this question

您也可以使用 by()功能:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

那些其他包(plyr,reshape)具有返回data.frame的好处,但值得熟悉by(),因为它是一个基本函数。

答案 4 :(得分:24)

library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

答案 5 :(得分:16)

如果x是包含您数据的数据框,那么以下内容将符合您的要求:

require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)

答案 6 :(得分:16)

只需添加第三个选项:

require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
编辑:这是一个非常古老的答案。现在我建议使用group_by中的summarisedplyr,就像@docendo answer一样。

答案 7 :(得分:16)

虽然我最近成为大多数这类操作的转换为dplyr,但对于某些事情,sqldf包仍然非常好(并且恕我直言更具可读性)。

以下是使用sqldf

解答此问题的示例
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

答案 8 :(得分:5)

另一种解决方案可以按矩阵或数据帧的形式按组返回总和,且又短又快:

rowsum(x$Frequency, x$Category)

答案 9 :(得分:4)

最近添加的dplyr::tally()现在比以往任何时候都更容易:

tally(x, Category)

Category     n
First        30
Second       5
Third        34

答案 10 :(得分:4)

当您需要在不同的列上应用不同的聚合函数(并且您必须/想要坚持以R为基础)时,我发现ave非常有用(有效):

例如

给出此输入:

DF <-                
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
           Categ2=factor(c('X','Y','X','X','X','Y','Y')),
           Samples=c(1,2,4,3,5,6,7),
           Freq=c(10,30,45,55,80,65,50))

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50

我们要按Categ1Categ2进行分组,并计算SamplesFreq的平均值。
这是使用ave的可能解决方案:

# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2 
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2 
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]

结果:

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65

答案 11 :(得分:4)

dplyr 1.0.0起,就可以使用across()函数:

df %>%
 group_by(Category) %>%
 summarise(across(Frequency, sum))

  Category Frequency
  <chr>        <int>
1 First           30
2 Second           5
3 Third           34

如果对多个变量感兴趣:

df %>%
 group_by(Category) %>%
 summarise(across(c(Frequency, Frequency2), sum))

  Category Frequency Frequency2
  <chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190

以及使用选择助手选择变量:

df %>%
 group_by(Category) %>%
 summarise(across(starts_with("Freq"), sum))

  Category Frequency Frequency2 Frequency3
  <chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380

样本数据:

df <- read.table(text = "Category Frequency Frequency2 Frequency3
                 1    First        10         10         20
                 2    First        15         30         60
                 3    First         5         15         30
                 4   Second         2          8         16
                 5    Third        14         70        140
                 6    Third        20        120        240
                 7   Second         3         21         42",
                 header = TRUE,
                 stringsAsFactors = FALSE)

答案 12 :(得分:3)

您可以使用软件包 Rfast 中的功能group.sum

Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34

Rfast 具有许多组功能,group.sum是其中之一。

答案 13 :(得分:2)

使用cast代替recast(注意'Frequency'现在是'value'

df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)

得到:

Category (all)
First     30
Second    5
Third     34

答案 14 :(得分:0)

library(tidyverse)

x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'), 
           Frequency = c(10, 15, 5, 2, 14, 20, 3))

count(x, Category, wt = Frequency)