Question

I have two unevenly-spaced time series that each measure separate attributes of the same system. The two series's data points are not sampled at the same times, and the series are not the same length. I would like to match each row from series A to the row of B that is closest to it in time. What I have in mind is to add a column to A that contains indexes to the closest row in B. Both series have a time column measured in Unix time (eg. 1459719755).

for example, given two datasets

a  time
2  1459719755
4  1459719772
3  1459719773

b  time
45 1459719756
2  1459719763
13 1459719766
22 1459719774

The first dataset should be updated to

a  time        index
2  1459719755  1
4  1459719772  4
3  1459719773  4

since B[1,]$time has the closest value to A[1,]$time, B[4,]$time has the closest value to A[2,]$time and A[3,]$time.

Is there any convenient way to do this?

Answer 1

尝试这样的事情：

(1+ecdf(bdat$time)(adat$time)*nrow(bdat))
[1] 1 4 4

为什么要这样做？ ecdf函数返回另一个值为0到1的函数。它返回由第一个参数定义的值分布中新值的“概率范围”[0,1]中的“位置”{{1 }}。该表达式实际上只是将该函数的结果重新调整为范围[1，nrow（bdat）]。（我认为这是优雅的。）

另一种方法是在ecdf的排序值上使用approxfun，然后让您获得插值。这些可能需要四舍五入。将它们用作索引将改为截断为整数。

bdat$time

在这两种情况下，您都要从“订单统计”中预测排序值。在第二种情况下，您应该检查是否以您希望的方式处理关系。

Match each row in a table to a row in another table based on the difference between row timestamps

1 个答案: