Question

在R（我相对较新）中，我有一个数据框，包含许多列和一个数字列，我需要根据另一列确定的组进行聚合。

 SessionID   Price
 '1',       '624.99'
 '1',       '697.99'
 '1',       '649.00'
 '7',       '779.00'
 '7',       '710.00'
 '7',       '2679.50'

我需要按SessionID进行分组，并为每个ONTO返回原始数据帧的最大值和最小值，例如：

 SessionID   Price     Min     Max
 '1',       '624.99'   624.99  697.99
 '1',       '697.99'   624.99  697.99
 '1',       '649.00'   624.99  697.99
 '7',       '779.00'   710.00  2679.50
 '7',       '710.00'   710.00  2679.50
 '7',       '2679.50'  710.00  2679.50

如何高效地完成这项工作？

Answer 1

使用基数R：

df <- transform(df, Min = ave(Price, SessionID, FUN = min),
                    Max = ave(Price, SessionID, FUN = max))
df
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

由于您所需的结果未汇总，而只是包含两个额外列的原始数据，因此您希望在基数R中使用ave而不是aggregate，如果您想aggregate，通常会使用mutate {1}} SessionID的数据。（注意：AEBilgrau表明您也可以使用聚合进行一些额外的匹配。）

同样，对于dplyr，您希望使用summarise代替library(dplyr) df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))，因为您不想汇总/汇总数据。

使用dplyr：

{{1}}

Answer 2

使用data.table包：

library(data.table)

dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))

dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
dt
#   SessionId Price Min  Max
#1:         1   624 624  697
#2:         1   697 624  697
#3:         1   649 624  697
#4:         7   779 710 2679
#5:         7   710 710 2679
#6:         7  2679 710 2679

如果您有data.frame df，请执行dt=as.data.table(df)并使用上面的代码。

我对平均data.frame上的解决方案基准感到好奇：

df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
dt = as.data.table(df)

algo1 <- function() 
{
    df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
}

algo2 <- function()
{
    dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
}

algo3 <- function()
{
    tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
    cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
}

algo4 <- function()
{
    transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
}   



#> system.time(algo1())
#   user  system elapsed 
#   0.03    0.00    0.19 

#> system.time(algo2())
#   user  system elapsed 
#   0.01    0.00    0.01 

#> system.time(algo3())
#   user  system elapsed 
#   0.77    0.01    0.78 

#> system.time(algo4())
#   user  system elapsed 
#   0.02    0.01    0.03

Answer 3

这是我使用aggregate的解决方案。

首先，加载数据：

df <- read.table(text = 
"SessionID   Price
'1'       '624.99'
'1'       '697.99'
'1'       '649.00'
'7'       '779.00'
'7'       '710.00'
'7'       '2679.50'", header = TRUE)

然后aggregate和match返回原来的data.frame：

tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
print(df)
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

编辑：根据下面的评论，您可能想知道为什么会这样。这确实有些奇怪。但请记住，data.frame只是一种幻想list。尝试拨打str(tmp)，您会看到Price列本身是2乘2的数字矩阵。它会让人感到困惑，因为print.data.frame知道如何处理这个问题，因此print(tmp)看起来有3列。无论如何，tmp[2]只需访问column / entry的第二个data.frame / list，并在data.frame时返回1列tmp[,2]访问第二列并返回存储的数据类型。

R集团聚合

3 个答案: