按ID

时间:2018-01-26 07:59:28

标签: r dplyr data.table

我已经生成了一系列小时时间戳:

intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour")

> intervals
[1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST"
[4] "2018-01-20 03:00:00 PST" 

如果数据集中包含杂乱且间隔不均匀的时间戳,那么如何将该数据集中的时间值与最近的每小时时间戳匹配id,并删除其间的其他时间戳?例如:

> test
                         time      id     amount
312   2018-01-20 00:02:14 PST       1 54.9508346
8652  2018-01-20 00:54:41 PST       2 30.5557992
13809 2018-01-20 01:19:27 PST       3 90.5459248
586   2018-01-20 00:03:35 PST       1 79.7635973
9077  2018-01-20 00:56:37 PST       2 75.5356406
21546 2018-01-20 02:25:05 PST       3 36.6017705
7275  2018-01-20 00:47:45 PST       1 12.7618139
12768 2018-01-20 01:15:30 PST       2 72.4465838
1172  2018-01-20 00:08:01 PST       3 81.0468155
24106 2018-01-20 03:04:10 PST       1  0.8615881
14464 2018-01-20 01:25:04 PST       2 49.8718743
15344 2018-01-20 01:29:30 PST       3 85.0054113
14255 2018-01-20 01:23:22 PST       1 34.5093891
21565 2018-01-20 02:25:40 PST       2 69.0175725
15602 2018-01-20 01:31:32 PST       3 61.8602426

会产生:

> output
             interval id     amount
1 2018-01-20 01:00:00  1 12.7618139
2          2018-01-20  1 54.9508346
3 2018-01-20 03:00:00  1  0.8615881
4 2018-01-20 01:00:00  2 75.5356400
5 2018-01-20 02:00:00  2 69.0175700
6          2018-01-20  3 81.0468200
7 2018-01-20 01:00:00  3 90.5459200
8 2018-01-20 02:00:00  3 36.6017700

我了解data.table

中存在可能的解决方案
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]

使用roll = nearest,但如何在intervals中为id中的每个test找到最近的匹配并保留amount属性?

任何建议将不胜感激!以下是示例数据:

 dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", 
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", 
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", 
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", 
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", 
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332, 
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L, 
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")

3 个答案:

答案 0 :(得分:5)

另一种选择是在 public ActionResult PrintReprtForSpecicDates(DateTime startdate, DateTime enddate) { using (ProDbDataContext _Context = new ProDbDataContext()) { List<Sp_GetSpecificRecordResult> RecordList = _Context.Sp_GetSpecificRecord(startdate,enddate).ToList(); var dt = Helper.Helper.ToDataTable(RecordList); RptGetSpecificRecords reportobj = new RptGetSpecificRecords(); reportobj.DataSource = dt; reportobj.Parameters["Startdate"].Value = Convert.ToDateTime(startdate).ToShortDateString(); reportobj.Parameters["Enddate"].Value = Convert.ToDateTime(enddate).ToShortDateString(); var stream = new MemoryStream(); reportobj.ExportToPdf(stream); return File(stream.GetBuffer(), "application/pdf"); } } 内加入j

data.table

给出:

# convert 'test' to a 'data.table' first with 'setDT'
# and convert the 'time'-column tot a datetime format
setDT(test)[, time := as.POSIXct(time)][]

# preform the join
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]

在上述方法中,某些 id time amount 1: 1 2018-01-20 00:00:00 54.9508346 2: 1 2018-01-20 01:00:00 12.7618139 3: 1 2018-01-20 02:00:00 34.5093891 4: 1 2018-01-20 03:00:00 0.8615881 5: 2 2018-01-20 00:00:00 30.5557992 6: 2 2018-01-20 01:00:00 75.5356406 7: 2 2018-01-20 02:00:00 69.0175725 8: 2 2018-01-20 03:00:00 69.0175725 9: 3 2018-01-20 00:00:00 81.0468155 10: 3 2018-01-20 01:00:00 90.5459248 11: 3 2018-01-20 02:00:00 36.6017705 12: 3 2018-01-20 03:00:00 36.6017705 - 值被amount分配给多个time。如果你不想那样,只想保留最接近id的那些,你可以按如下方式改进方法:

time

给出:

test[, r := rowid(id)
     ][, .SD[.(time = intervals)
             , on = .(time)
             , roll = 'nearest'
             , .(time, amount, r, time_diff = abs(x.time - i.time))
             ][, .SD[which.min(time_diff)], by = r]
       , by = id][, c('r','time_diff') := NULL][]

答案 1 :(得分:1)

灵感来自@DavidAurenburg解决方案,精简版:

test[, 
    .(amount=amount[which.min(abs(time - round(time, "hour")))]), 
    keyby=.(id, as.character(round(time, "hour")))]

上面的帖子没有匹配用户要求的输出

也许你想在你的连接中包含id。使用最近的时,您可能会从几小时前的数据中获得匹配

output <- test[intervals, on=c("id","time"), roll="nearest"]
setorder(output, id, time)
output
#                    time id     amount
#  1: 2018-01-20 00:00:00  1 54.9508346
#  2: 2018-01-20 01:00:00  1 12.7618139
#  3: 2018-01-20 02:00:00  1 34.5093891
#  4: 2018-01-20 03:00:00  1  0.8615881
#  5: 2018-01-20 00:00:00  2 30.5557992
#  6: 2018-01-20 01:00:00  2 75.5356406
#  7: 2018-01-20 02:00:00  2 69.0175725
#  8: 2018-01-20 03:00:00  2 69.0175725
#  9: 2018-01-20 00:00:00  3 81.0468155
# 10: 2018-01-20 01:00:00  3 90.5459248
# 11: 2018-01-20 02:00:00  3 36.6017705
# 12: 2018-01-20 03:00:00  3 36.6017705

希望看到更优雅地使用data.table来解决这个问题。

数据:

intervals <- CJ(time=seq(as.POSIXct("2018-01-20 00:00:00"), 
    as.POSIXct("2018-01-20 03:00:00"), 
    by="hour"), id=1:3)

test <- fread("time,id,amount
2018-01-20 00:02:14 PST,1,54.9508346
2018-01-20 00:54:41 PST,2,30.5557992
2018-01-20 01:19:27 PST,3,90.5459248
2018-01-20 00:03:35 PST,1,79.7635973
2018-01-20 00:56:37 PST,2,75.5356406
2018-01-20 02:25:05 PST,3,36.6017705
2018-01-20 00:47:45 PST,1,12.7618139
2018-01-20 01:15:30 PST,2,72.4465838
2018-01-20 00:08:01 PST,3,81.0468155
2018-01-20 03:04:10 PST,1,0.8615881
2018-01-20 01:25:04 PST,2,49.8718743
2018-01-20 01:29:30 PST,3,85.0054113
2018-01-20 01:23:22 PST,1,34.5093891
2018-01-20 02:25:40 PST,2,69.0175725
2018-01-20 01:31:32 PST,3,61.8602426")[,
    time:=as.POSIXct(time)]

答案 2 :(得分:1)

使用lubridate这样的东西?

library(lubridate);library(dplyr)
test$time<-ymd_hms(test$time)
test$HTime=round_date(test$time,unit="hour")
test$DiffTime=abs(test$time-test$HTime)
result=test%>%group_by(id,HTime)%>%summarize(amount=amount[DiffTime==min(DiffTime)])
result


 # A tibble: 8 x 3
# Groups: id [?]
     id HTime               amount
  <dbl> <dttm>               <dbl>
1  1.00 2018-01-20 00:00:00 55.0  
2  1.00 2018-01-20 01:00:00 12.8  
3  1.00 2018-01-20 03:00:00  0.862
4  2.00 2018-01-20 01:00:00 75.5  
5  2.00 2018-01-20 02:00:00 69.0  
6  3.00 2018-01-20 00:00:00 81.0  
7  3.00 2018-01-20 01:00:00 90.5  
8  3.00 2018-01-20 02:00:00 36.6