Question

我希望使用两者中存在的“日期时间”列合并两个数据集（POSIXct格式：dd / mm / yyyy hh：mm）。以下是来自两个数据集的示例数据：

# Dataset 1 (dts1)

           datetime   count   period   
1  30/03/2011 02:32      27      561
2  30/03/2011 02:42       3      600
3  30/03/2011 02:52       0      574
4  30/03/2011 03:02       1      550
5  30/03/2011 03:12      15      600
6  30/03/2011 03:22       0      597

# Dateset 2 (dts2)

   datetime         dist car   satd      alt
1 30/03/2011 01:59  23.9   1      3     1.76       
2 30/03/2011 02:58  14.7   1      7     6.36       
3 30/03/2011 03:55  10.4   2      9    -0.34      
4 30/03/2011 04:53  35.4   1      3     3.55      
5 30/03/2011 05:52  56.1   1      7    -0.91       
6 30/03/2011 06:48  12.3   1      4     6.58      
7 30/03/2011 07:48  10.7   1      5     4.18

如果是合并两个帧中匹配行的简单情况，则可以使用基本merge(data1, data2, by="datetime")或rbind()函数。

然而，我的问题更复杂，因为两个数据集中的时间间隔不相等。 Dataset 1包含精确 10分钟间隔的数据（每行包含有关在指定日期/时间结束的10分钟块的信息），而dataset 2包含数据近似 1小时间隔（每行包含在指定日期/时间结束的1小时块中的信息）。

为了使事情变得更加困难，两个数据集中行的开始时间之间存在时间不匹配（即dts1：01/03/2013 10:00:00 ，dts2：01/03/2012 09:58:12）。 dts2间隔在整个数据集中也有所不同（±1分钟左右的几分钟）。我想将数据集1中的每个10分钟数据行与数据集2中的小时块（以及来自dts2的所有关联列值）相关联。将有一些行可以适合2个不同的小时块（即30/03/2011 03:02），但我只需要将这些行链接到其中一个小时块。

我想结束这样的事情：

        datetime_dts1 count period     datetime2_dts2  dist  car satd      alt  
1    30/03/2011 02:32    27    561   30/03/2011 02:58  14.7   1     7     6.36      
2    30/03/2011 02:42     3    600   30/03/2011 02:58  14.7   1     7     6.36
3    30/03/2011 02:52     0    574   30/03/2011 02:58  14.7   1     7     6.36
4    30/03/2011 03:02     1    550   30/03/2011 02:58  14.7   1     7     6.36
5    30/03/2011 03:12    15    600   30/03/2011 03:55  10.4   2     9    -0.34
6    30/03/2011 03:22     0    597   30/03/2011 03:55  10.4   2     9    -0.34

我已经搜索了这个问题的答案，但一直无法解决它，我的R能力没有提升。如果有人能给我指导或提供解决方案，我将非常感激。

Answer 1

首次将日期时间字符串转换为POSIXt类后，round和trunc的某些组合应该会为您提供可用作合并基础的内容。

首先读入您的数据，并创建相应的POSIXt日期时间：

dts1 <- structure(list(datetime = structure(1:6,
   .Label = c("30/03/2011 02:32", "30/03/2011 02:42",
   "30/03/2011 02:52", "30/03/2011 03:02", "30/03/2011 03:12",
   "30/03/2011 03:22"), class = "factor"), count = c(27L, 3L,
   0L, 1L, 15L, 0L), period = c(561L, 600L, 574L, 550L, 600L,
   597L)), .Names = c("datetime", "count", "period"),
   class = "data.frame", row.names = c(NA, -6L))
dts2 <- structure(list(datetime = structure(1:7,
    .Label = c("30/03/2011 01:59", "30/03/2011 02:58",
    "30/03/2011 03:55", "30/03/2011 04:53", "30/03/2011 05:52",
    "30/03/2011 06:48", "30/03/2011 07:48"), class = "factor"),
    dist = c(23.9, 14.7, 10.4, 35.4, 56.1, 12.3, 10.7), car =
    c(1L, 1L, 2L, 1L, 1L, 1L, 1L), satd = c(3L, 7L, 9L, 3L, 7L,
    4L, 5L), alt = c(1.76, 6.36, -0.34, 3.55, -0.91, 6.58,
    4.18)), .Names = c("datetime", "dist", "car", "satd",
    "alt"), class = "data.frame", row.names = c(NA, -7L))

# create corresponding POSIXlt vector
# (you could update the 'datetime' columns in-place if you prefer)
datetime1 <- strptime(dts1$datetime, format="%d/%m/%Y %H:%M")
datetime2 <- strptime(dts2$datetime, format="%d/%m/%Y %H:%M")

以下代码根据所有情况下的最近小时生成合并表。在合并内部，它只是在每个数据帧的前面加上一个舍入时间的列，根据它（即列号1）合并，然后使用-1索引在末尾删除该列：

# merge based on nearest hour
merge(
    cbind(round(datetime1, "hours"), dts1),
    cbind(round(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd  alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7 6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7 6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7 6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7 6.36
5 30/03/2011 03:12    15    600 30/03/2011 02:58 14.7   1    7 6.36
6 30/03/2011 03:22     0    597 30/03/2011 02:58 14.7   1    7 6.36

如上所述，但这次只是在小时内截断：

merge(
    cbind(trunc(datetime1, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 03:55 10.4   2    9 -0.34
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

如上所述，但是对于dts1来说，将记录视为属于前一个小时，直到过去一小时的10分钟，在截断前减去10 * 60秒。这个产生你指定的相同输出，但没有更多信息我不确定它是你想要的确切规则。

merge(
    cbind(trunc(datetime1 - 10*60, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7  6.36
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

你可以调整你所围绕的那些细节，你截断哪些细节，以及你是否先根据你的具体规则减去/增加一些时间。

修改

不是最优雅的，但这是一种不同的方法，适用于您在评论中描述的更复杂的条件规则。这很大程度上依赖于来自动物园包的na.locf来首先确定每个dts1记录之前和之后的dts2次。有了这些，只需要应用规则来选择所需的dts2时间，匹配回原始的dts1表，然后合并。

library(zoo) # create ordered list of all datetimes, using names to keep # track of which ones come from each data frame alldts <- sort(c( setNames(datetime1, rep("dts1", length(datetime1))), setNames(datetime2, rep("dts2", length(datetime2))))) is.dts1 <- names(alldts)=="dts1" # for each dts1 record, get previous closest dts2 time dts2.prev <- alldts dts2.prev[is.dts1] <- NA dts2.prev <- na.locf(dts2.prev, na.rm=FALSE)[is.dts1] # for each dts1 record, get next closest dts2 time dts2.next <- alldts dts2.next[is.dts1] <- NA dts2.next <- na.locf(dts2.next, na.rm=FALSE, fromLast=TRUE)[is.dts1] # for each dts1 record, apply rule to choose dts2 time use.prev <- !is.na(dts2.prev) & (alldts[is.dts1] - dts2.prev < 5) dts2.to.use <- ifelse(use.prev, as.character(dts2.prev), as.character(dts2.next)) # merge based on chosen dts2 times, prepended as character vector # for the purpose of merging merge( cbind(.dt=dts2.to.use[match(datetime1, alldts[is.dts1])], dts1), cbind(.dt=as.character(datetime2), dts2), by=".dt", all.x=TRUE, suffixes=c("_dts1", "_dts2") )[-1]

Answer 2

data.table的选择器似乎非常适合于此。确实有效，并且可以合并最接近的值（上限或下限或两者）。

在此网站上找到：https://www.r-bloggers.com/understanding-data-table-rolling-joins/左右连接示例...等

website[, join_time:=session_start_time]
paypal[, join_time:=purchase_time]
setkey(website, name, join_time)
setkey(paypal, name, join_time)
website[paypal, roll = T]

关于DT：https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

如何通过具有不匹配的日期/时间值的公共列合并r中的两个数据帧

2 个答案: