从层次关系链中提取信息

时间:2018-11-28 11:31:09

标签: r dplyr data.table

最初,数据以具有父子关系的不同ID表示,并且每行代表不同的交易及其ID。 我需要分析的原始数据集看起来像这样。

dt.original.data <- structure(list(msg_seq_nb = c("0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494", 
                                  "0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541", "0024541", "0024569"
    ), orig_msg_seq_nb = c(NA, NA, "0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494", 
                           "0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541")
    , trc_st = c("T","C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R")
    , trd_rpt_dt = structure(c(15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987), class = "Date")
    , trd_rpt_tm = c(34838, 34853, 34853, 34863, 34863, 36231, 36231, 36305, 36305, 36328, 36328, 36330, 36330, 38831, 38831, 38925, 38925, 42984, 42984, 43002, 43002))
    , row.names = c(NA, -21L), class = c("data.table", "data.frame"))

   >  dt.original.data
    msg_seq_nb orig_msg_seq_nb trc_st trd_rpt_dt trd_rpt_tm
 1:    0005747            <NA>      T 2013-10-09      34838
 2:    0005747            <NA>      C 2013-10-09      34853
 3:    0005765         0005747      R 2013-10-09      34853
 4:    0005765         0005747      C 2013-10-09      34863
 5:    0005783         0005765      R 2013-10-09      34863
 6:    0005783         0005765      C 2013-10-09      36231
 7:    0008333         0005783      R 2013-10-09      36231
 8:    0008333         0005783      C 2013-10-09      36305

如您所见,通过orig_msg_seq_nbmsg_seq_nb之间的连接输入的所有交易之间存在层次关系。因此,我设法使用递归联接将匹配的组合基本上添加到一行中。 使用我对此问题https://stackoverflow.com/a/53395260/5795592提供的答案来完成此操作 结果的摘录如下:

>      msg_seq_nb Initial.Trade.Status Initial.Trd.Rpt.Dt Initial.Trd.Rpt.Tm J2.Msg.Nb J2.Trade.Status J2.Trd.Rpt.Tm
   1:    0005747                    T         2013-10-09              34838   0005765               R         34853
   2:    0005747                    T         2013-10-09              34838   0005765               C         34863

我现在想分析每个关系链,并想提取每个关系链末尾的状态。因此,在上述示例数据的情况下,由于在msg_seq_nb 0008333的trc_st中最后输入的交易状态为C,因此我需要删除层次结构链中的所有msg_seq_nb。

基于此关系链的最终状态(最初是一系列不同的交易),确定具有ID的交易是否保留在原始数据集中,或者是否必须用最终状态更新初始状态,即最后一个递归步骤后的状态。

这与该SQL问题https://dba.stackexchange.com/questions/96098/finding-the-end-of-a-relationship-chain-optimally

有某种联系

1 个答案:

答案 0 :(得分:0)

现在,您在这里提供的预期结果就是我要做的:

library(data.table)

dt.original.data <- structure(
  list(
    msg_seq_nb = c("0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494",
                   "0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541", "0024541", "0024569"
), orig_msg_seq_nb = c(NA, NA, "0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494", 
                       "0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541")
, trc_st = c("T","C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R")
, trd_rpt_dt = structure(c(15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987), class = "Date")
, trd_rpt_tm = c(34838, 34853, 34853, 34863, 34863, 36231, 36231, 36305, 36305, 36328, 36328, 36330, 36330, 38831, 38831, 38925, 38925, 42984, 42984, 43002, 43002))
, row.names = c(NA, -21L), class = c("data.table", "data.frame"))

final_transition <- dt.original.data[trc_st != "T" & orig_msg_seq_nb %in% dt.original.data[trc_st == "T"]$msg_seq_nb]
final_transition <- merge(dt.original.data[trc_st == "T", c("trc_st", "trd_rpt_tm", "msg_seq_nb")], final_transition, by.x = "msg_seq_nb", by.y = "orig_msg_seq_nb")

col_names <- c("msg_seq_nb", "Initial.Trade.Status", "Initial.Trd.Rpt.Dt", "Initial.Trd.Rpt.Tm", "J2.Msg.Nb", "J2.Trade.Status", "J2.Trd.Rpt.Tm")

setnames(final_transition, 
         c("msg_seq_nb", "trc_st.x", "trd_rpt_dt", "trd_rpt_tm.x", "msg_seq_nb.y", "trc_st.y", "trd_rpt_tm.y"),
         col_names)
setcolorder(final_transition, col_names)

final_transition