当我在R数据表中对group by使用“ tail”选项时,如何从数据表中选择所有列?

时间:2018-07-04 17:39:57

标签: r

我的数据表idcreated_timeamountbalancecreated_month中有5列。我想要每个idcreated_month的最后一行,我试图将所有5列按created_monthid分组。

输入数据表 test

id  created_time amount balance  created_month
 1 1/15/14 10:17      2       1         1/1/14
 1 1/15/14 11:17      2       1         1/1/14
 1 1/15/14 20:17      2       1         1/1/14
 2 1/15/14 11:17      2       1         1/1/14
 2 1/16/14 12:17      2       1         1/1/14
 2 2/16/14 23:17      2       1         2/1/14

我按idcreated_time排序为

setkeyv(test, c("id","created_time"))

我需要

  1. 转换created_month以显示该月的第一天。类似于Sql中的date_trunc('month',created_month)。
  2. 按created_time列对值进行排序并获取所有列 按“ id”和“ created_month”分组

下面的一个只能使我保持平衡,因为我的尾部选项只有一个字段

test[ ,  tail(balance,1L) , by=c("balanceable_id","created_month" )]

我不确定如何在尾部添加多个字段以显示原始表中的所有列。

我的目标是获取此数据表:

id created_month        created_time amount balance
 1    2014-01-01 2014-01-15 20:17:00      2       1
 2    2014-01-01 2014-01-16 12:17:00      2       1
 2    2014-02-01 2014-02-16 23:17:00      2       1

1 个答案:

答案 0 :(得分:1)

其中一种方法可能是

library(data.table)
library(lubridate)

setDT(df)[, created_time := as.POSIXct(created_time, "%m/%d/%y %H:%M", tz = "GMT")  #convert to timestamp format
          ][, created_month := floor_date(created_time, "month")    #add a column having 1st day of created_time's month
            ][order(id, created_month)    
              ][, .SD[.N], .(id, created_month)]                    #fetch last records

给出

   id created_month        created_time amount balance
1:  1    2014-01-01 2014-01-15 20:17:00      2       1
2:  2    2014-01-01 2014-01-16 12:17:00      2       1
3:  2    2014-02-01 2014-02-16 23:17:00      2       1


示例数据

df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), created_time = c("1/15/14 10:17", 
"1/15/14 11:17", "1/15/14 20:17", "1/15/14 11:17", "1/16/14 12:17", 
"2/16/14 23:17"), amount = c(2L, 2L, 2L, 2L, 2L, 2L), balance = c(1L, 
1L, 1L, 1L, 1L, 1L)), .Names = c("id", "created_time", "amount", 
"balance"), class = "data.frame", row.names = c(NA, -6L))

#  id  created_time amount balance
#1  1 1/15/14 10:17      2       1
#2  1 1/15/14 11:17      2       1
#3  1 1/15/14 20:17      2       1
#4  2 1/15/14 11:17      2       1
#5  2 1/16/14 12:17      2       1
#6  2 2/16/14 23:17      2       1