Question

我想根据每个日期最大的人数，每个人每个日期观察一次，以使我的df更小。

这是我的df：

names      dates quantity
1    tom 2010-02-01       28
3    tom 2010-03-01        7
2   mary 2010-05-01       30
6    tom 2010-06-01       21
4   john 2010-07-01       45
5   mary 2010-07-01       30
8   mary 2010-07-01       28
11   tom 2010-08-01       28
7   john 2010-09-01       28
10  john 2010-09-01       30
9   john 2010-07-01       45
12  mary 2010-11-01       28
13  john 2010-12-01        7
14  john 2010-12-01       14

我首先找到每个人每个日期的最大数量。这可行，但正如您所看到的，如果一个人的数量相等，他们每个日期保留相同数量的obs。

merge(df, aggregate(quantity ~ names+dates, df, max))



 names      dates quantity
1   john 2010-07-01       45
2   john 2010-07-01       45
3   john 2010-09-01       30
4   john 2010-12-01       14
5   mary 2010-05-01       30
6   mary 2010-07-01       30
7   mary 2010-11-01       28
8    tom 2010-02-01       28
9    tom 2010-03-01        7
10   tom 2010-06-01       21
11   tom 2010-08-01       28

所以，我的下一步将是每个日期采取第一个障碍（假设我已经选择了最大数量）。我无法得到正确的代码。这就是我的尝试：

merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m  ##doesn't get rid of one obs for john

和data.table选项

l[, .SD[1], by=c(names,dates)]  ##doesn't work at all

我喜欢聚合和data.table选项，因为它们很快，而df有~100k行。

提前感谢您！

解

我发布得太快 - 道歉!!解决这个问题的一个简单方法就是找到重复项，然后删除它们。例如。，;

merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]

这是系统时间

 system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
   user  system elapsed 
  20.04    0.04   20.07 



 system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
   user  system elapsed 
 19.129   0.028  19.148

Answer 1

我不确定这会为您提供所需的输出，但它肯定会处理“重复的行”：

 # Replicating your dataframe
 df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14)) 

 temp = merge(df, aggregate(quantity ~ names+dates, df, max))
 df.unique = unique(temp)

Answer 2

这是一个data.table解决方案：

dt[, max(quantity), by = list(names, dates)]

台式：

N = 1e6

dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)

op = function(df) aggregate(quantity ~ names+dates, df, max) 
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]

microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
#     expr      min        lq   median        uq      max neval
#   op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209    10
# eddi(dt)  148.088  162.8073  198.222  220.1217  286.058    10

Answer 3

如果您使用的是data.frame，

 library(plyr)
    ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))

Answer 4

do.call( rbind, 
        lapply( split(df, df[,c("names","dates") ]), function(d){
                                         d[which.max(d$quantity), ] } )
        )

获取每个id的最大值，然后只获得每个id R的值

4 个答案: