选择具有变量最大值的行子组

时间:2015-10-16 17:55:31

标签: r

所以我有一个数据框:

ID:    YearMon:   Var: Count:
1      012007      H            1
1      012007      D            2
1      022007                   NA
1      032007      H            1
2      012007      H            1
2      022007                   Na
2      022007      D            1
2      032007                   NA

我如何获得某个YearMon的每个唯一ID的最大值?理想情况下它会返回:

1      012007      D            2
1      022007                   NA
1      032007      H            1
2      012007      H            1
2      022007      D            1
2      032007                   NA

5 个答案:

答案 0 :(得分:1)

使用plyr这应该很容易实现。这将按ID和YearMon进行过滤,并在数据框中返回最大值以及ID和YearMon。

library(plyr)

ddply( dat1 , .(ID,YearMon)  ,function(x) {
Count = max( x$Count )
data.frame( Count=Count , Var=x[x$Count == Count,"Var"] )
})

要返回所有列:

df[ is.na( df$Count ) , "Count" ] <- -9999

df2 <- ddply(df, .(ID,YearMon) , function(x){

Count = max( x$Count )
index = which( x$Count == max( x$Count ))
y <- x[ index ,]                                
data.frame( y )

})

df2[ df2$Count == -9999, "Count" ] <- NA

这也会将索引值返回到NA。

答案 1 :(得分:1)

使用data.table,如果您有一个名为dt的数据表,您可以先计算按组计数的最大值,然后只保留Count等于该组的最大值的行:

newdt <- dt[, max.count := max(Count), by=.(ID, YearMon)][Count==max.count,.(ID, YearMon, Var, Count)]

答案 2 :(得分:1)

library(dplyr)

dt %>%
  group_by(ID, YearMon) %>%
  slice(Count %>% which.max)

答案 3 :(得分:0)

让我们不要忘记聚合!

  #####Clean up data. You need to change your grouping variables to factors and data needs to be numeric####

dat1$Var.[dat1$Var.==1]=""
dat1$Count.<-as.numeric(dat1$Count.)
dat1$ID.<-as.factor(dat1$ID.)
dat1$YearMon.<-as.factor(dat1$YearMon.)
dat1<-dat1[,-3] ###Lets get rid of the Var column as you're not using it.


aggregate(. ~ ID.+YearMon.,data = dat1,FUN=max ) #### Use aggregate. Simple and short code
  ID. YearMon. Count.
1   1    12007      2
2   2    12007      1
3   2    22007      1
4   1    32007      1

答案 4 :(得分:0)

使用data.table的另一个if/else选项。我们将'data.frame'转换为'data.table'(setDT(df1)),按'ID'分组,'YearMon',if all的'Count'值为' NA'我们返回Data.table(.SD)或else的子集,我们得到'Count'的最大值索引,并将data.table(.SD[which.max(Count)])的子集。

library(data.table)
setDT(df1)[, if(all(is.na(Count))) .SD else .SD[which.max(Count)],.(ID, YearMon)]
 #   ID YearMon Var Count
 #1:  1   12007   D     2
 #2:  1   22007        NA
 #3:  1   32007   H     1
 #4:  2   12007   H     1
 #5:  2   22007   D     1
 #6:  2   32007        NA

或者另一种选择是连接which.max的索引和按变量分组的'Count'的所有'NA'的行,得到行索引(.I)并使用它将'data.table'分组。

setDT(df1)[df1[, .I[c(which.max(Count), all(is.na(Count)))], .(ID, YearMon)]$V1]
#   ID YearMon Var Count
#1:  1   12007   D     2
#2:  1   22007        NA
#3:  1   32007   H     1
#4:  2   12007   H     1
#5:  2   22007   D     1
#6:  2   32007        NA

或者我们replace NA的数量非常小,请使用which.max和子集

setDT(df1)[, .SD[which.max(replace(Count, is.na(Count),-Inf ))], .(ID, YearMon)]

数据

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
 YearMon = c(12007L, 
12007L, 22007L, 32007L, 12007L, 22007L, 22007L, 32007L), Var = c("H", 
"D", "", "H", "H", "", "D", ""), Count = c(1L, 2L, NA, 1L, 1L, 
NA, 1L, NA)), .Names = c("ID", "YearMon", "Var", "Count"),
class = "data.frame", row.names = c(NA, -8L))