这是我的问题的MWE。
数据:
library(data.table)
#dates in %Y-%m-%d
df <- data.table(date=as.Date(c("2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02")), dtm=c(18L, 18L, 18L, 18L, 18L, 18L,46L,46L,74L, 74L,74L,74L,165L, 165L,165L,165L), cval=c(1275L, 1300L, 1300L, 1320L, 1325L, 1325L, 1300L, 1300L, 1300L, 1300L, 1325L, 1325L, 1300L, 1300L, 1325L, 1325L), price_in=c(24.125, 24.625, 35.750, 16.250, 14.500, 50.250, 43.625, 49.125, 58.250, 58.250, 45.375, 70.125, 90.750, 74.750, 77.875, 85.500), price_out=c(26.125, 26.625, 36.625, 17.500, 15.500, 52.250, 45.625, 51.125, 60.000, 60.250, 47.375, 72.125, 92.750, 76.750, 79.875, 87.500), type=c("P", "C", "P", "C", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P"))
df
date dtm cval price_in price_out type
1: 2001-01-02 18 1275 24.125 26.125 P
2: 2001-01-02 18 1300 24.625 26.625 C
3: 2001-01-02 18 1300 35.750 36.625 P
4: 2001-01-02 18 1320 16.250 17.500 C
5: 2001-01-02 18 1325 14.500 15.500 C
6: 2001-01-02 18 1325 50.250 52.250 P
7: 2001-01-02 46 1300 43.625 45.625 C
8: 2001-01-02 46 1300 49.125 51.125 P
9: 2001-01-02 74 1300 58.250 60.000 C
10: 2001-01-02 74 1300 58.250 60.250 P
11: 2001-01-02 74 1325 45.375 47.375 C
12: 2001-01-02 74 1325 70.125 72.125 P
13: 2001-01-02 165 1300 90.750 92.750 C
14: 2001-01-02 165 1300 74.750 76.750 P
15: 2001-01-02 165 1325 77.875 79.875 C
16: 2001-01-02 165 1325 85.500 87.500 P
我想做什么:
P
和C
类型dtm
但{a}更大的cval
。
对于示例数据集中的第二项,这将是: date dtm cval price_in price_out type
2001-01-02 18 1300 24.625 26.625 C #the item
2001-01-02 18 1320 16.250 17.500 C #same dtm, higher cval
2001-01-02 18 1325 14.500 15.500 C #same dtm, higher cval
cval1
成为当前项的cval
,即cval1 = 1300
和cval2
此项中较大的cval
项子集,即此处cval2 = c(1320L, 1325L)
。然后,我想应用自定义排除功能,例如让我们说price_in[cval %in% cval2]-price_out[cval==cval1]-0.5*(cval1-cval2) < 0
TRUE
的所有项目对。同样(相同的程序,不同的排除标准)适用于P
项目。
预期输出:原始data.table,df
减去上述过程中排除的行。例如,使用上面的示例函数评估项目2和4将返回TRUE:16.25-26.625-0.5*(1300-1320) = -0.375 < 0
。因此,预期输出为df
而没有第2行和第4行(请注意,对2和5不返回TRUE:14.5-26.625-0.5*(1300-1325) = 0.375 >= 0
,因此不排除5):
date dtm cval price_in price_out type
1: 2001-01-02 18 1275 24.125 26.125 P
3: 2001-01-02 18 1300 35.750 36.625 P
5: 2001-01-02 18 1325 14.500 15.500 C
6: 2001-01-02 18 1325 50.250 52.250 P
7: 2001-01-02 46 1300 43.625 45.625 C
8: 2001-01-02 46 1300 49.125 51.125 P
... ... ... ...
等等。显然,与项目7和8的情况一样,如果没有其他项目具有相同的特征(相同的日期, dtm 和类型 ),不能排除。
到目前为止我尝试了什么:
df[,id:=seq_along(date)]
,然后通过for loop
遍历日期并使用向量来检查我的自定义函数。如果结果向量包含TRUE
,我从data.table中删除了相应的索引。
很明显,这种方法可行,但考虑到我的数据大小,它几乎永远运行。date/dtm
子集的许多滚动自连接,类似于"df[df,roll=Inf,by=.(date,dtm)]"
的行(因为我认为完全滚动的自连接不适用于这个案例)。但我不太了解它。问题:有没有办法通过data.table方法实现这个排除过程?可能(但不一定)通过多个滚动自连接?
任何帮助都将受到高度赞赏!