我有两个表,policies
和claims
policies<-data.table(policyNumber=c(123,123,124,125),
EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")),
EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
policyNumber EFDT EXDT
1: 123 2012-01-01 2013-01-01
2: 123 2013-01-01 2014-01-01
3: 124 2013-01-01 2014-01-01
4: 125 2013-02-01 2014-02-01
claims<-data.table(claimNumber=c(1,2,3,4),
policyNumber=c(123,123,123,124),
lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
claimAmount=c(10,20,20,15))
> claims
claimNumber policyNumber lossDate claimAmount
1: 1 123 2012-02-01 10
2: 2 123 2012-08-15 20
3: 3 123 2013-01-01 20
4: 4 124 2013-10-31 15
政策表确实包含政策条款,因为每一行都由政策编号和生效日期唯一标识。
我想以一种将声明与策略术语相关联的方式合并这两个表。如果索赔具有相同的策略编号且索赔的lossDate属于策略期限的生效日期和到期日期(有效日期为包含边界且到期日期为独占边界),则该索赔与策略术语相关联。我以这种方式合并表格?
这应该类似于左外连接。结果应该看起来像
policyNumber EFDT EXDT claimNumber lossDate claimAmount
1: 123 2012-01-01 2013-01-01 1 2012-02-01 10
2: 123 2012-01-01 2013-01-01 2 2012-08-15 20
3: 123 2013-01-01 2014-01-01 3 2013-01-01 20
4: 124 2013-01-01 2014-01-01 4 2013-10-31 15
5: 125 2013-02-01 2014-02-01 NA <NA> NA
答案 0 :(得分:9)
版本1(针对data.table v1.9.4 +更新)
试试这个:
# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126),
EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")),
EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))
# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6),
policyNumber=c(123,123,123,124,126,126),
lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
claimAmount=c(10,20,20,15,5,25))
# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)
# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]
# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]
# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)
# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))
ans
# policyNumber EFDT EXDT claimNumber lossDate claimAmount inPolicy
#1: 123 2012-01-01 2013-01-01 1 2012-02-01 10 TRUE
#2: 123 2012-01-01 2013-01-01 2 2012-08-15 20 TRUE
#3: 123 2013-01-01 2014-01-01 3 2013-01-01 20 TRUE
#4: 124 2013-01-01 2014-01-01 4 2013-10-31 15 TRUE
#5: 126 <NA> <NA> 5 2012-06-01 5 FALSE
#6: 126 2013-02-01 2014-02-01 6 2014-03-01 25 FALSE
#7: 125 2013-02-01 2014-02-01 NA <NA> NA NA
版本2
@Arun建议使用foverlaps
中的新data.table
功能。我下面的尝试似乎更难,也不容易,所以请让我知道如何改进它。
## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate] ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)
## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]
## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]
## Remove duplicates (such as policyNumber==123 & claimNumber==3),
## and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
unique(ans2), ## select only the unique values
policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)
ans2
## policyNumber EFDT EXDT claimNumber lossDate claimAmount inPolicy
## 1: 123 2012-01-01 2013-01-01 1 2012-02-01 10 TRUE
## 2: 123 2012-01-01 2013-01-01 2 2012-08-15 20 TRUE
## 3: 123 2012-01-01 2013-01-01 3 2013-01-01 20 TRUE
## 4: 124 2013-01-01 2014-01-01 4 2013-10-31 15 TRUE
## 5: 126 <NA> <NA> 5 2012-06-01 5 FALSE
## 6: 126 <NA> <NA> 6 2014-03-01 25 FALSE
## 7: 125 2013-02-01 2014-02-01 NA <NA> NA NA
版本3
使用foverlaps()
,另一个版本:
require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))
foverlaps()
需要开始和结束范围/间隔。因此,我们将lossDate
列重复到lossDate2
。
由于EXDT
需要打开间隔,我们会从中减去一个,并将其放在新列EXDTclosed
中。
现在,我们设置密钥。 foverlaps()
要求最后两个键列为间隔。所以他们最后指定。我们还希望通过policyNumber
重叠加入第一次匹配。因此,它也在密钥中指定。
我们需要在claims
上设置密钥(检查?foverlaps
)。我们不必在policies
上设置密钥。但是如果你愿意,你可以(然后你可以跳过by.x
参数,因为它默认采用键值)。由于我们未在此处设置policies
的密钥,因此我们将明确指定by.x
参数中的相应列。默认情况下,重叠类型为any
,我们不必更改(因此未指定)。这导致:
# policyNumber claimNumber lossDate claimAmount lossDate2 EFDT EXDT EXDTclosed
# 1: 123 1 2012-02-01 10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2: 123 2 2012-08-15 20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3: 123 3 2013-01-01 20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4: 124 4 2013-10-31 15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5: 125 NA <NA> NA <NA> 2013-02-01 2014-02-01 2014-01-31
答案 1 :(得分:1)
我认为这主要是你想要的。我需要运行,所以没有时间添加没有声明的策略并清理列,但我认为难以解决的问题:
setkey(policies, policyNumber, EXDT)
policies[, EXDT2:=EXDT]
policies[claims[, list( policyNumber, lossDate, lossDate, claimNumber, claimAmount)], roll=-Inf]
# policyNumber EXDT EFDT EXDT2 lossDate claimNumber claimAmount
# 1: 123 2012-02-01 2012-01-01 2013-01-01 2012-02-01 1 10
# 2: 123 2012-08-15 2012-01-01 2013-01-01 2012-08-15 2 20
# 3: 123 2013-01-01 2012-01-01 2013-01-01 2013-01-01 3 20
# 4: 124 2013-10-31 2013-01-01 2014-01-01 2013-10-31 4 15
另请注意,从此结果中删除/突出显示保单日期之外的声明是微不足道的。