最初,数据以具有父子关系的不同ID表示,并且每行代表不同的交易及其ID。 我需要分析的原始数据集看起来像这样。
dt.original.data <- structure(list(msg_seq_nb = c("0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494",
"0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541", "0024541", "0024569"
), orig_msg_seq_nb = c(NA, NA, "0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494",
"0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541")
, trc_st = c("T","C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R")
, trd_rpt_dt = structure(c(15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987), class = "Date")
, trd_rpt_tm = c(34838, 34853, 34853, 34863, 34863, 36231, 36231, 36305, 36305, 36328, 36328, 36330, 36330, 38831, 38831, 38925, 38925, 42984, 42984, 43002, 43002))
, row.names = c(NA, -21L), class = c("data.table", "data.frame"))
> dt.original.data
msg_seq_nb orig_msg_seq_nb trc_st trd_rpt_dt trd_rpt_tm
1: 0005747 <NA> T 2013-10-09 34838
2: 0005747 <NA> C 2013-10-09 34853
3: 0005765 0005747 R 2013-10-09 34853
4: 0005765 0005747 C 2013-10-09 34863
5: 0005783 0005765 R 2013-10-09 34863
6: 0005783 0005765 C 2013-10-09 36231
7: 0008333 0005783 R 2013-10-09 36231
8: 0008333 0005783 C 2013-10-09 36305
如您所见,通过orig_msg_seq_nb
到msg_seq_nb
之间的连接输入的所有交易之间存在层次关系。因此,我设法使用递归联接将匹配的组合基本上添加到一行中。
使用我对此问题https://stackoverflow.com/a/53395260/5795592提供的答案来完成此操作
结果的摘录如下:
> msg_seq_nb Initial.Trade.Status Initial.Trd.Rpt.Dt Initial.Trd.Rpt.Tm J2.Msg.Nb J2.Trade.Status J2.Trd.Rpt.Tm
1: 0005747 T 2013-10-09 34838 0005765 R 34853
2: 0005747 T 2013-10-09 34838 0005765 C 34863
我现在想分析每个关系链,并想提取每个关系链末尾的状态。因此,在上述示例数据的情况下,由于在msg_seq_nb
0008333
的trc_st中最后输入的交易状态为C
,因此我需要删除层次结构链中的所有msg_seq_nb。>
基于此关系链的最终状态(最初是一系列不同的交易),确定具有ID的交易是否保留在原始数据集中,或者是否必须用最终状态更新初始状态,即最后一个递归步骤后的状态。
这与该SQL问题https://dba.stackexchange.com/questions/96098/finding-the-end-of-a-relationship-chain-optimally
有某种联系答案 0 :(得分:0)
现在,您在这里提供的预期结果就是我要做的:
library(data.table)
dt.original.data <- structure(
list(
msg_seq_nb = c("0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494",
"0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541", "0024541", "0024569"
), orig_msg_seq_nb = c(NA, NA, "0005747", "0005747", "0005765", "0005765", "0005783", "0005783", "0008333", "0008333", "0008494",
"0008494", "0008556", "0008556", "0008560", "0008560", "0013622", "0013622", "0013797", "0013797", "0024541")
, trc_st = c("T","C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R", "C", "R")
, trd_rpt_dt = structure(c(15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987, 15987), class = "Date")
, trd_rpt_tm = c(34838, 34853, 34853, 34863, 34863, 36231, 36231, 36305, 36305, 36328, 36328, 36330, 36330, 38831, 38831, 38925, 38925, 42984, 42984, 43002, 43002))
, row.names = c(NA, -21L), class = c("data.table", "data.frame"))
final_transition <- dt.original.data[trc_st != "T" & orig_msg_seq_nb %in% dt.original.data[trc_st == "T"]$msg_seq_nb]
final_transition <- merge(dt.original.data[trc_st == "T", c("trc_st", "trd_rpt_tm", "msg_seq_nb")], final_transition, by.x = "msg_seq_nb", by.y = "orig_msg_seq_nb")
col_names <- c("msg_seq_nb", "Initial.Trade.Status", "Initial.Trd.Rpt.Dt", "Initial.Trd.Rpt.Tm", "J2.Msg.Nb", "J2.Trade.Status", "J2.Trd.Rpt.Tm")
setnames(final_transition,
c("msg_seq_nb", "trc_st.x", "trd_rpt_dt", "trd_rpt_tm.x", "msg_seq_nb.y", "trc_st.y", "trd_rpt_tm.y"),
col_names)
setcolorder(final_transition, col_names)
final_transition