根据其他列保持观察

时间:2017-08-03 02:49:01

标签: r duplicates data.table data-manipulation

此问题是here的延伸 如果我的数据有一个名为Remark的列:

ID    Name    Type    Date          Amount   Remark
1     AAAA    First   2009/7/20     100      Not want
1     AAAA    First   2010/2/3      200      want ya
2     BBBB    First   2015/3/10     250      
2     CCC     Second  2009/2/23     300      good
2     CCC     Second  2010/1/25     400      OK Right123
2     CCC     Third   2015/4/9      500      
2     CCC     Third   2016/6/25     700      Stackoverflow is awesome

我希望我的结果在Date最大时保持不变 首先,如果我不考虑列Remark,我可以使用max()来获取此信息:

dt[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
   ID Name   Type       Date  Amount
1:  1 AAAA  First 2010-02-03     300
2:  2 BBBB  First 2015-03-10     250
3:  2  CCC Second 2010-01-25     700
4:  2  CCC  Third 2016-06-25    1200

但是,我如何保留备注。

   ID Name   Type       Date  Amount      Remark
1:  1 AAAA  First 2010-02-03     300      want ya
2:  2 BBBB  First 2015-03-10     250      
3:  2  CCC Second 2010-01-25     700      OK Right123
4:  2  CCC  Third 2016-06-25    1200      Stackoverflow is awesome

这是我的数据:

dt <- fread("
        ID    Name    Type    Date          Amount   Remark
        1     AAAA    First   2009/7/20     100      Not.want
        1     AAAA    First   2010/2/3      200      want.ya
        2     BBBB    First   2015/3/10     250      
        2     CCC     Second  2009/2/23     300      good
        2     CCC     Second  2010/1/25     400      OK.Right123
        2     CCC     Third   2015/4/9      500      
        2     CCC     Third   2016/6/25     700      Stackoverflow.is.awesome
        ")
dt$Date <- as.Date(dt$Date)

2 个答案:

答案 0 :(得分:1)

我们可以使用join

setcolorder(dt[, setdiff(names(dt), "Amount"), with = FALSE][dt[,  .(Date = max(Date), 
                 Amount = sum(Amount)),
       by = .(ID, Name, Type)], on = .(ID, Name, Type, Date)], names(dt))[]
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

或没有加入

dt1 <- dt[, c(Amount = sum(.SD[["Amount"]]), .SD[which.max(Date), 
  setdiff(names(.SD), "Amount"), with = FALSE]), .(ID, Name, Type)]

setcolorder(dt1, names(dt))
dt1
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

如果有更多“金额”列为sum med

nm1 <- grep("Amount\\d*", names(dt), value = TRUE)
setcolorder(dt[, setdiff(names(dt), nm1), with = FALSE][dt[, c(Date= max(Date),
       lapply(.SD, sum)), by = .(ID, Name, Type), .SDcols = nm1],
      on = .(ID, Name, Type, Date)], names(dt))[]

答案 1 :(得分:1)

> df
   ID Name   Type       Date Amount                   Remark
1:  1 AAAA  First 03-02-2010    200                  want ya
2:  2  CCC  Third 09-04-2015    500                         
3:  2 BBBB  First 10-03-2015    250                         
4:  1 AAAA  First 20-07-2009    100                 Not want
5:  2  CCC Second 23-02-2009    300                     good
6:  2  CCC Second 25-01-2010    400              OK Right123
7:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome

> df2=df[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
> df2
   ID Name   Type       Date Amount
1:  2 BBBB  First 10-03-2015    250
2:  1 AAAA  First 20-07-2009    300
3:  2  CCC Second 25-01-2010    700
4:  2  CCC  Third 25-06-2016   1200


> df[df2,]
   ID Name   Type       Date Amount                   Remark i.ID i.Name i.Type i.Amount
1:  2 BBBB  First 10-03-2015    250                             2   BBBB  First      250
2:  1 AAAA  First 20-07-2009    100                 Not want    1   AAAA  First      300
3:  2  CCC Second 25-01-2010    400              OK Right123    2    CCC Second      700
4:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome    2    CCC  Third     1200


> df3=df[df2,c("ID","Name","Type","Date","Remark","i.Amount")]
> df3
   ID Name   Type       Date                   Remark i.Amount
1:  2 BBBB  First 10-03-2015                               250
2:  1 AAAA  First 20-07-2009                 Not want      300
3:  2  CCC Second 25-01-2010              OK Right123      700
4:  2  CCC  Third 25-06-2016 Stackoverflow is awesome     1200