我正在努力解决与合并两个data.tables(尽管它也可能是data.frames)相关的问题,这个问题基于不相等的时间戳(POSIXct)。 根据表 A 中的某个时间戳,我希望R将表 B 中的条目之前返回到< EM> A 的
示例:
我有表格A ,其中包含有关某个时间点的活动的数据。
编辑:这是与原始帖子不同的数据,可以更好地反映问题:我需要根据时间戳和我称之为站点ID的分组变量进行“查找”。抱怨道,首先不清楚。。
Start.Time Start.Station.ID
1: 2014-04-06 18:24:32 238
2: 2014-04-06 18:20:30 238
3: 2014-04-06 01:04:13 373
4: 2014-04-06 01:03:36 373
5: 2014-04-06 01:03:37 373
6: 2014-04-06 01:03:01 373
7: 2014-04-06 01:02:42 373
8: 2014-04-06 01:02:31 373
我想在该表A中添加一个列,用于指示该站的状态在某个时间点的“可用性”。这些状态可在表B 中找到。
status_dt station_id availability
1: 2014-04-06 00:29:02 238 0.9354839
2: 2014-04-06 00:29:02 373 1.0000000
3: 2014-04-06 01:29:03 238 1.0000000
4: 2014-04-06 01:29:03 373 0.6111111
5: 2014-04-06 02:59:03 238 0.9354839
6: 2014-04-06 02:59:03 373 0.6666667
...
41: 2014-04-06 17:59:03 238 0.8387097
42: 2014-04-06 17:59:03 373 0.4444444
43: 2014-04-06 18:59:03 238 0.9032258
44: 2014-04-06 18:59:03 373 0.5000000
45: 2014-04-06 20:29:03 238 0.7741935
status_dt station_id availability
时间戳不匹配,因此我想在表A中的时间戳之前的观察中向表A添加表B中的状态。
预期结果将是例如列'可用性':
status_dt station_id availability
1: 2014-04-06 18:24:32 238 0.8387097
2: 2014-04-06 18:20:30 238 0.8387097
3: 2014-04-06 01:04:13 373 1.0000000
4: 2014-04-06 01:03:36 373 1.0000000
5: 2014-04-06 01:03:37 373 1.0000000
6: 2014-04-06 01:03:01 373 1.0000000
7: 2014-04-06 01:02:42 373 1.0000000
8: 2014-04-06 01:02:31 373 1.0000000
如果Start.Station.ID/station_id中的条目是唯一的,但BodieG的建议有效,但将此建议应用于此数据会给出
status_dt station_id availability Start.Station.ID
1: 2014-04-06 18:24:32 373 0.4444444 238
2: 2014-04-06 18:20:30 373 0.4444444 238
3: 2014-04-06 01:04:13 373 1.0000000 373
4: 2014-04-06 01:03:36 373 1.0000000 373
5: 2014-04-06 01:03:37 373 1.0000000 373
6: 2014-04-06 01:03:01 373 1.0000000 373
7: 2014-04-06 01:02:42 373 1.0000000 373
8: 2014-04-06 01:02:31 373 1.0000000 373
前两行中的条目不是我预期的(或者更确切地说是希望的):它们是指第373站而不是238中的“可用性”。
我想代码只需要进行调整以反映时间戳和电台ID,但我在这里碰到了我的头。 另外我无法弄清楚是否使用建议的xts-package会有所帮助,因为很明显我在这里有重复的时间步骤......
同样,任何提示都非常感激。 提前谢谢!
重现性:
表A:
structure(list(Start.Time = structure(c(1396808672, 1396808430,
1396746253, 1396746216, 1396746217, 1396746181, 1396746162, 1396746151
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Start.Station.ID = c(238,
238, 373, 373, 373, 373, 373, 373)), .Names = c("Start.Time",
"Start.Station.ID"), class = c("data.table", "data.frame"), row.names = c(NA,
-8L))
表B:
structure(list(status_dt = structure(c(1396744142, 1396744142,
1396747743, 1396747743, 1396753143, 1396753143, 1396754942, 1396754942,
1396756743, 1396756743, 1396758542, 1396758542, 1396760343, 1396760343,
1396765743, 1396765743, 1396767542, 1396767542, 1396772943, 1396772943,
1396778402, 1396778402, 1396781943, 1396781943, 1396785542, 1396785542,
1396787342, 1396787342, 1396790942, 1396790942, 1396794543, 1396794543,
1396798143, 1396798143, 1396799943, 1396799943, 1396801743, 1396801743,
1396805343, 1396805343, 1396807143, 1396807143, 1396810743, 1396810743,
1396816143, 1396816143, 1396817942, 1396817942, 1396821542, 1396821542,
1396826942, 1396826942), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
station_id = c(238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373), availability = c(0.935483870967742,
1, 1, 0.611111111111111, 0.935483870967742, 0.666666666666667,
0.967741935483871, 0.666666666666667, 0.967741935483871,
0.666666666666667, 0.935483870967742, 0.666666666666667,
0.967741935483871, 0.666666666666667, 0.967741935483871,
0.611111111111111, 0.967741935483871, 0.611111111111111,
1, 0.444444444444444, 0.870967741935484, 0.5, 0.806451612903226,
0.5, 0.774193548387097, 0.388888888888889, 0.709677419354839,
0.388888888888889, 0.67741935483871, 0.333333333333333, 1,
0.5, 0.903225806451613, 0.444444444444444, 0.935483870967742,
0.444444444444444, 0.903225806451613, 0.444444444444444,
0.870967741935484, 0.444444444444444, 0.838709677419355,
0.444444444444444, 0.903225806451613, 0.5, 0.774193548387097,
0.611111111111111, 0.766666666666667, 0.611111111111111,
0.774193548387097, 0.555555555555556, 0.870967741935484,
0.666666666666667)), .Names = c("status_dt", "station_id",
"availability"), class = c("data.table", "data.frame"), row.names = c(NA,
-52L), sorted = "status_dt")
答案 0 :(得分:4)
您可以使用roll
参数:
setkey(B, status_dt)
B[A, roll=TRUE]
产地:
status_dt station_id availability Start.Station.ID
1: 2014-04-06 21:07:42 225 0.4864865 225
2: 2014-04-06 21:06:50 225 0.4864865 225
3: 2014-04-06 21:06:49 225 0.4864865 225
4: 2014-04-06 21:06:15 225 0.4864865 225
5: 2014-04-06 21:04:35 225 0.4864865 225
6: 2014-04-06 21:05:33 225 0.4864865 225
7: 2014-04-06 21:04:45 225 0.4864865 225
8: 2014-04-06 21:04:37 225 0.4864865 225
9: 2014-04-06 21:04:35 225 0.4864865 225
10: 2014-04-06 21:01:45 225 0.4864865 225
11: 2014-04-06 21:00:57 225 0.4864865 225
12: 2014-04-06 20:59:04 225 0.4864865 225
13: 2014-04-06 20:58:04 225 0.8648649 225
14: 2014-04-06 20:57:22 225 0.8648649 225
15: 2014-04-06 20:57:24 225 0.8648649 225
16: 2014-04-06 20:56:40 225 0.8648649 225
17: 2014-04-06 20:55:52 225 0.8648649 225
18: 2014-04-06 20:55:25 225 0.8648649 225
19: 2014-04-06 20:55:24 225 0.8648649 225
20: 2014-04-06 20:55:00 225 0.8648649 225
21: 2014-04-06 18:25:30 225 0.9729730 225
22: 2014-04-06 18:25:28 225 0.9729730 225
status_dt station_id availability Start.Station.ID
这与您的预期输出非常匹配,除了它有一些额外的行,据我所知,根据您对问题的描述是合法的。
答案 1 :(得分:2)
R> dfA <- as.data.frame(A)
R> a <- xts(dfA[,2], order.by=dfA[,1])
R> dfB <- as.data.frame(B)
R> b <- xts(dfB[,-1], order.by=dfB[,1])
现在我们有两个xts对象,我们只需merge()
并在结果上运行na.locf()
就可以使用先前值填充NA
:
R> na.locf(merge(a, b))
a station_id availability
2014-04-06 17:59:03 NA 225 0.972973
2014-04-06 18:25:28 225 225 0.972973
2014-04-06 18:25:30 225 225 0.972973
2014-04-06 18:59:03 225 225 0.621622
2014-04-06 20:29:03 225 225 0.864865
2014-04-06 20:55:00 225 225 0.864865
2014-04-06 20:55:24 225 225 0.864865
2014-04-06 20:55:25 225 225 0.864865
2014-04-06 20:55:52 225 225 0.864865
2014-04-06 20:56:40 225 225 0.864865
2014-04-06 20:57:22 225 225 0.864865
2014-04-06 20:57:24 225 225 0.864865
2014-04-06 20:58:04 225 225 0.864865
2014-04-06 20:59:02 225 225 0.486486
2014-04-06 20:59:04 225 225 0.486486
2014-04-06 21:00:57 225 225 0.486486
2014-04-06 21:01:45 225 225 0.486486
2014-04-06 21:04:35 225 225 0.486486
2014-04-06 21:04:35 225 225 0.486486
2014-04-06 21:04:37 225 225 0.486486
2014-04-06 21:04:45 225 225 0.486486
2014-04-06 21:05:33 225 225 0.486486
2014-04-06 21:06:15 225 225 0.486486
2014-04-06 21:06:49 225 225 0.486486
2014-04-06 21:06:50 225 225 0.486486
2014-04-06 21:07:42 225 225 0.486486
2014-04-06 21:59:02 225 225 0.162162
2014-04-06 23:29:02 225 225 0.162162
R>
但是这里也应该有一个data.table答案......
编辑:根据评论,此处仅合并a
个时间戳:
R> na.locf(merge(a, b))[index(a), -1]
station_id availability
2014-04-06 18:25:28 225 0.972973
2014-04-06 18:25:30 225 0.972973
2014-04-06 20:55:00 225 0.864865
2014-04-06 20:55:24 225 0.864865
2014-04-06 20:55:25 225 0.864865
2014-04-06 20:55:52 225 0.864865
2014-04-06 20:56:40 225 0.864865
2014-04-06 20:57:22 225 0.864865
2014-04-06 20:57:24 225 0.864865
2014-04-06 20:58:04 225 0.864865
2014-04-06 20:59:04 225 0.486486
2014-04-06 21:00:57 225 0.486486
2014-04-06 21:01:45 225 0.486486
2014-04-06 21:04:35 225 0.486486
2014-04-06 21:04:35 225 0.486486
2014-04-06 21:04:37 225 0.486486
2014-04-06 21:04:45 225 0.486486
2014-04-06 21:05:33 225 0.486486
2014-04-06 21:06:15 225 0.486486
2014-04-06 21:06:49 225 0.486486
2014-04-06 21:06:50 225 0.486486
2014-04-06 21:07:42 225 0.486486
R>
在这种特殊情况下,我还删除了冗余站ID列。