我想根据每个日期最大的人数,每个人每个日期观察一次,以使我的df更小。
这是我的df:
names dates quantity
1 tom 2010-02-01 28
3 tom 2010-03-01 7
2 mary 2010-05-01 30
6 tom 2010-06-01 21
4 john 2010-07-01 45
5 mary 2010-07-01 30
8 mary 2010-07-01 28
11 tom 2010-08-01 28
7 john 2010-09-01 28
10 john 2010-09-01 30
9 john 2010-07-01 45
12 mary 2010-11-01 28
13 john 2010-12-01 7
14 john 2010-12-01 14
我首先找到每个人每个日期的最大数量。这可行,但正如您所看到的,如果一个人的数量相等,他们每个日期保留相同数量的obs。
merge(df, aggregate(quantity ~ names+dates, df, max))
names dates quantity
1 john 2010-07-01 45
2 john 2010-07-01 45
3 john 2010-09-01 30
4 john 2010-12-01 14
5 mary 2010-05-01 30
6 mary 2010-07-01 30
7 mary 2010-11-01 28
8 tom 2010-02-01 28
9 tom 2010-03-01 7
10 tom 2010-06-01 21
11 tom 2010-08-01 28
所以,我的下一步将是每个日期采取第一个障碍(假设我已经选择了最大数量)。我无法得到正确的代码。这就是我的尝试:
merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m ##doesn't get rid of one obs for john
和data.table选项
l[, .SD[1], by=c(names,dates)] ##doesn't work at all
我喜欢聚合和data.table选项,因为它们很快,而df有~100k行。
提前感谢您!
解
我发布得太快 - 道歉!!解决这个问题的一个简单方法就是找到重复项,然后删除它们。例如。,;
merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]
这是系统时间
system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
user system elapsed
20.04 0.04 20.07
system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
user system elapsed
19.129 0.028 19.148
答案 0 :(得分:2)
我不确定这会为您提供所需的输出,但它肯定会处理“重复的行”:
# Replicating your dataframe
df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14))
temp = merge(df, aggregate(quantity ~ names+dates, df, max))
df.unique = unique(temp)
答案 1 :(得分:2)
这是一个data.table
解决方案:
dt[, max(quantity), by = list(names, dates)]
台式:
N = 1e6
dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)
op = function(df) aggregate(quantity ~ names+dates, df, max)
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]
microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209 10
# eddi(dt) 148.088 162.8073 198.222 220.1217 286.058 10
答案 2 :(得分:1)
如果您使用的是data.frame,
library(plyr)
ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))
答案 3 :(得分:1)
do.call( rbind,
lapply( split(df, df[,c("names","dates") ]), function(d){
d[which.max(d$quantity), ] } )
)