匹配和重新匹配data.tables

时间:2013-11-05 08:38:14

标签: r data.table

我需要过滤一些交易数据,我很困惑如何管理它。以下是我的数据的一个简单示例:

set.seed(1)
start.date <- as.POSIXct("2011-01-01 09:30:01", tz = "GMT")
dates <- seq(start.date, length = 10, by = "days")
tr_dt <- as.integer(gsub("-", "", as.Date(dates)))
DT <- data.table(TM_STMP = dates, PR = format(rlnorm(10, 2), digits = 2), VOL = rpois(10, 200), TRD_EXCTN_DT = tr_dt, TRD_RPT_DT = tr_dt, ASOF_CD = "")
DT[5] <- DT[2]
DT[6] <- DT[2]
DT[7] <- DT[2]
DT[8] <- DT[2]
DT$TRD_RPT_DT[5] <- 20131109
DT$TRD_RPT_DT[6] <- 20131109
DT$TRD_RPT_DT[7] <- 20131109
DT$TRD_RPT_DT[8] <- 20131109
DT$ASOF_CD[5] <- "R"
DT$ASOF_CD[6] <- "A"
DT$ASOF_CD[7] <- "R"
DT$ASOF_CD[8] <- "A"
DT
                TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
 1: 2011-01-01 09:30:01  3.9 221     20131105   20131105
 2: 2011-01-02 09:30:01  8.9 205     20131106   20131106
 3: 2011-01-03 09:30:01  3.2 191     20131107   20131107
 4: 2011-01-04 09:30:01 36.4 195     20131108   20131108
 5: 2011-01-02 09:30:01  8.9 205     20131106   20131109       R
 6: 2011-01-02 09:30:01  8.9 205     20131106   20131109       A
 7: 2011-01-02 09:30:01  8.9 205     20131106   20131109       R
 8: 2011-01-02 09:30:01  8.9 205     20131106   20131109       A
 9: 2011-01-09 09:30:01 13.1 208     20131113   20131113
10: 2011-01-10 09:30:01  5.4 212     20131114   20131114

我要做的是:

1)获取ASOF_CD != "R"的所有行,并根据ASOF_CD == ""TM_STMPPR将其与TRD_EXCTN_DT行匹配({{1} }})&lt; ASOF_CD == ""(适用于TRD_RPT_DT)。只有一个ASOF_CD == "R"可以匹配一个""。这应该导致:

"R"

2)从data.table中删除 TM_STMP PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD 2: 2011-01-02 09:30:01 8.9 205 20110102 20110102 5: 2011-01-02 09:30:01 8.9 205 20110102 20131109 R "R"这些匹配项。然后data.table看起来像:

""

3)使用 TM_STMP PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD 1: 2011-01-01 09:30:01 3.9 221 20110101 20110101 2: 2011-01-03 09:30:01 3.2 191 20110103 20110103 3: 2011-01-04 09:30:01 36.4 195 20110104 20110104 4: 2011-01-02 09:30:01 8.9 205 20110102 20131109 A 5: 2011-01-02 09:30:01 8.9 205 20110102 20131109 R 6: 2011-01-02 09:30:01 8.9 205 20110102 20131109 A 7: 2011-01-09 09:30:01 13.1 208 20110109 20110109 8: 2011-01-10 09:30:01 5.4 212 20110110 20110110 获取所有剩余行,并根据ASOF_CD == "R"ASOF_CD == "A"TM_STMP将其与PR行匹配(对于{{ 1}})&lt; = TRD_EXCTN_DT(对于ASOF_CD == "A")。只有一个TRD_RPTD_DT可以匹配一个ASOF_CD == "R"。比赛是:

"A"

4)从data.table中删除"R" TM_STMP PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD 4: 2011-01-02 09:30:01 8.9 205 20110102 20131109 A 5: 2011-01-02 09:30:01 8.9 205 20110102 20131109 R 这些匹配项。最终结果是以下data.table:

"R"

我想到了第一个任务,我尝试使用:

"A"

我使用 TM_STMP PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD 1: 2011-01-01 09:30:01 3.9 221 20110101 20110101 2: 2011-01-03 09:30:01 3.2 191 20110103 20110103 3: 2011-01-04 09:30:01 36.4 195 20110104 20110104 4: 2011-01-02 09:30:01 8.9 205 20110102 20131109 A 5: 2011-01-09 09:30:01 13.1 208 20110109 20110109 6: 2011-01-10 09:30:01 5.4 212 20110110 20110110 参数来匹配setkey(DT, "TM_STMP", "PR", "TRD_EXCTN_DT") DT[ASOF_CD == ""][DT[ASOF_CD == "R", list(TM_STMP, PR, TRD_RPT_DT)], roll = Inf, nomatch = 0, mult = "first"] &lt; roll=InfTRD_EXCTN_DT只能在TRD_RPT_DT中获得一场比赛,但这会给我两场比赛:

mult="first"

此外,对于步骤1)和2),我不知道如何匹配以获得与DT[ASOF_CD == ""]匹配的 TM_STMP PR TRD_EXCTN_DT VOL TRD_RPT_DT ASOF_CD 1: 2011-01-02 09:30:01 8.9 20131109 205 20131106 2: 2011-01-02 09:30:01 8.9 20131109 205 20131106 。是否有一个内部联接的解决方案可以立即为我提供匹配的"R"""的第一对,所以我可以删除它们?

1 个答案:

答案 0 :(得分:1)

这是一个可以构建其余部分的起点。我假设你自己正确地弄清了钥匙和卷筒,并且只会使用它。

添加某种索引,例如行号:

DT[, idx := .I]

# Now set your key and do the merge, but keep track of *all* the matching indices
# and pick one index from each match (not sure if you need nomatch - you'll have
# to experiment about that)
setkey(DT, "TM_STMP", "PR", "TRD_EXCTN_DT")
DT[ASOF_CD == ""][DT[ASOF_CD == "R", list(TM_STMP, PR, TRD_RPT_DT, idx.R = idx)],
                  roll = Inf, allow.cartesian = T][,
                  if (.GRP <= length(idx)) list(idx = idx[.GRP]),
                  by = c(key(DT), "idx.R")]
#               TM_STMP   PR TRD_EXCTN_DT idx.R idx
#1: 2011-01-02 09:30:01  8.9     20131109     3   2

idx.Ridx则是您想要抛弃的指数。