R data.table通过组和条件进行连接/子集化/匹配

时间:2014-02-22 09:03:28

标签: r match data.table subset

我正在尝试按2个data.tables中的组子集/匹配数据,但无法弄清楚这是如何在R. 我有以下data.table,它有一个City_ID和一个时间戳(列名=时间)。

Library(data.table)  
timetable <- data.table(City_ID=c("12","9"),
                        Time=c("12-29-2013-22:05:03","12-29-2013-11:59:00")) 

我有第二个data.table,有几个城市和时间戳的观察(加上额外的数据)。该表如下所示:

DT = data.table(City_ID =c("12","12","12","9","9","9"),
                Time= c("12-29-2013-13:05:13","12-29-2013-22:05:03",
                        "12-28-2013-13:05:13","12-29-2013-11:59:00",
                        "01-30-2013-10:05:03","12-28-2013-13:05:13"), 
                Other=1:6)

现在我需要找到DT中每个城市的观察结果,其中有一个时间&gt; =时间在其他data.table“时间表”(基本上是匹配表)。只应保留那些记录(包括未用于计算的列;在示例列中“其他”)。我想要的结果如下:

desiredresult = data.table(City_ID=c("12","9"),
                           Time= c("12-29-2013-22:05:03","12-29-2013-11:59:00"),
                           Other=c("2","4"))

我尝试了以下内容:

setkey(DT, City_ID, Time)  
setkey(timetable, City_ID)  
failedresult = DT[,Time >= timetable[Time], by=City_ID]  
failedresult2 = DT[,Time >= timetable, by=City_ID]  
BTW:我知道额外分割日期和时间可能会更好,但这可能会使示例更加复杂(当我测试通过data.table在时间戳中找到最小值时,它似乎有效)。

1 个答案:

答案 0 :(得分:3)

以下是执行此任务的方法:

# 1) transform string to POSIXct object
DT[ , Time := as.POSIXct(strptime(Time, "%m-%d-%Y-%X"))]
timetable[ , Time := as.POSIXct(strptime(Time, "%m-%d-%Y-%X"))]

# 2) set key
setkey(DT, City_ID)
setkey(timetable, City_ID)

# 3) join tables
DT2 <- DT[timetable]

# 4) extract rows and columns
DT2[Time >= Time.1, names(DT), with = FALSE]

#    City_ID                Time Other
# 1:      12 2013-12-29 22:05:03     2
# 2:       9 2013-12-29 11:59:00     4