所以我有一个数据框:
ID: YearMon: Var: Count:
1 012007 H 1
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 Na
2 022007 D 1
2 032007 NA
我如何获得某个YearMon的每个唯一ID的最大值?理想情况下它会返回:
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 D 1
2 032007 NA
答案 0 :(得分:1)
使用plyr
这应该很容易实现。这将按ID和YearMon进行过滤,并在数据框中返回最大值以及ID和YearMon。
library(plyr)
ddply( dat1 , .(ID,YearMon) ,function(x) {
Count = max( x$Count )
data.frame( Count=Count , Var=x[x$Count == Count,"Var"] )
})
要返回所有列:
df[ is.na( df$Count ) , "Count" ] <- -9999
df2 <- ddply(df, .(ID,YearMon) , function(x){
Count = max( x$Count )
index = which( x$Count == max( x$Count ))
y <- x[ index ,]
data.frame( y )
})
df2[ df2$Count == -9999, "Count" ] <- NA
这也会将索引值返回到NA。
答案 1 :(得分:1)
使用data.table
,如果您有一个名为dt
的数据表,您可以先计算按组计数的最大值,然后只保留Count等于该组的最大值的行:
newdt <- dt[, max.count := max(Count), by=.(ID, YearMon)][Count==max.count,.(ID, YearMon, Var, Count)]
答案 2 :(得分:1)
library(dplyr)
dt %>%
group_by(ID, YearMon) %>%
slice(Count %>% which.max)
答案 3 :(得分:0)
让我们不要忘记聚合!
#####Clean up data. You need to change your grouping variables to factors and data needs to be numeric####
dat1$Var.[dat1$Var.==1]=""
dat1$Count.<-as.numeric(dat1$Count.)
dat1$ID.<-as.factor(dat1$ID.)
dat1$YearMon.<-as.factor(dat1$YearMon.)
dat1<-dat1[,-3] ###Lets get rid of the Var column as you're not using it.
aggregate(. ~ ID.+YearMon.,data = dat1,FUN=max ) #### Use aggregate. Simple and short code
ID. YearMon. Count.
1 1 12007 2
2 2 12007 1
3 2 22007 1
4 1 32007 1
答案 4 :(得分:0)
使用data.table
的另一个if/else
选项。我们将'data.frame'转换为'data.table'(setDT(df1)
),按'ID'分组,'YearMon',if
all
的'Count'值为' NA'我们返回Data.table(.SD
)或else
的子集,我们得到'Count'的最大值索引,并将data.table(.SD[which.max(Count)]
)的子集。
library(data.table)
setDT(df1)[, if(all(is.na(Count))) .SD else .SD[which.max(Count)],.(ID, YearMon)]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
或者另一种选择是连接which.max
的索引和按变量分组的'Count'的所有'NA'的行,得到行索引(.I
)并使用它将'data.table'分组。
setDT(df1)[df1[, .I[c(which.max(Count), all(is.na(Count)))], .(ID, YearMon)]$V1]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
或者我们replace
NA的数量非常小,请使用which.max
和子集
setDT(df1)[, .SD[which.max(replace(Count, is.na(Count),-Inf ))], .(ID, YearMon)]
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
YearMon = c(12007L,
12007L, 22007L, 32007L, 12007L, 22007L, 22007L, 32007L), Var = c("H",
"D", "", "H", "H", "", "D", ""), Count = c(1L, 2L, NA, 1L, 1L,
NA, 1L, NA)), .Names = c("ID", "YearMon", "Var", "Count"),
class = "data.frame", row.names = c(NA, -8L))