我试图用dplyr解决下面的问题并设法取得一些进展,但我在某些方面面临的问题很少。
问题陈述
在每个组中(按ID分组),如果相同ID的当前HID和先前HID不同并且Interval< 30,然后Penalty列应显示Amount的值。在所有其他条件下,它应该显示0(其他条件可能意味着HID相同,或者HID不同但Interval> = 30)
数据
"ID","DaysToEvent","HID","Interval","Amount"
2197560,16369,"011",29,90105
2197560,16494,"121",29,50526
2197560,16509,"121",29,194568
2197560,16569,"001",31,27236
2197560,16577,"128",29,17309
2197578,14447,"001",29,17276
2197578,14468,"021",29,12661
2197578,14489,"001",31,15015
2197578,14517,"001",29,19000
2197578,14517,"02P",29,19001
2197578,14517,"001",31,19002
2197578,14517,"001",29,19003
2197578,14517,"001",29,19004
我的代码
mycoredata2009 = read.csv('path/to/abovefile.csv')
CumulativeCumulativeCost = 0;
mycoredata2009 = mycoredata2009 %>%
group_by(ID) %>%
mutate(Penalty = ifelse( ((HID != lag(HID)) & Interval < 30) ,Amount,0)) %>%
mutate(CumulativeCost=cumsum(as.numeric(Penalty))) %>%
CumulativeCumulativeCost = cumsum(as.numeric(CumulativeCost)) %>%
cat(paste("For group with ID==",ID,"CumulativeCost==", CumulativeCost,sep=""))
mycoredata2009 = as.data.frame(mycoredata2009)
我目前面临的问题
但是,代码存在一些问题
即使当前的HID,Penalty列也会显示Amount的值 和以前的HID是一样的。(对其他两个正常工作 条件)
CumulativeCost列应该是运行成本 Penalty列始终显示NA
在每组结束时,我想打印CumulativeCost 分组并继续插入ID和CumulativeCost 分组到最终输出数据框
收到输出
ID DaysToEvent HID Interval Amount Penalty CumulativeCost
1 2197560 16369 011 29 90105 NA NA
2 2197560 16494 121 29 50526 50526 NA
3 2197560 16509 121 29 194568 194568 NA
4 2197560 16569 001 31 27236 0 NA
5 2197560 16577 128 29 17309 17309 NA
6 2197578 14447 001 29 17276 NA NA
7 2197578 14468 021 29 12661 12661 NA
8 2197578 14489 001 31 15015 0 NA
9 2197578 14517 001 29 19000 19000 NA
10 2197578 14517 02P 29 19001 19001 NA
11 2197578 14517 001 31 19002 0 NA
12 2197578 14517 001 29 19003 19003 NA
13 2197578 14517 001 29 19004 19004 NA
预期输出(手动计算)
ID DaysToEvent HID Interval Amount Penalty CumulativeCost
1 2197560 16369 011 29 90105 NA NA
2 2197560 16494 121 29 50526 50526 50526
3 2197560 16509 121 29 194568 0 50526
4 2197560 16569 001 31 27236 0 50526
5 2197560 16577 128 29 17309 17309 67835
6 2197578 14447 001 29 17276 NA NA
7 2197578 14468 021 29 12661 12661 12661
8 2197578 14489 001 31 15015 0 12661
9 2197578 14517 001 29 19000 0 12661
10 2197578 14517 02P 29 19001 19001 31662
11 2197578 14517 001 31 19002 0 31662
12 2197578 14517 001 29 19003 0 31662
13 2197578 14517 001 29 19004 0 31662
答案 0 :(得分:2)
根据预期输出,在我们使用逻辑条件(HID!=lag(HID,...)
)创建“惩罚”列后,将每个组的“惩罚”列中的第一个观察值更改为“NA”,获取{其他行的{1}},并将cumsum
附加到其中(NA
)以创建“CumulativeCost”
c(NA, cumsum(...)
或者我们可以移除 library(dplyr)
mycoredata2009%>%
group_by(ID) %>%
mutate(Penalty= ifelse(HID!=lag(HID, default=0) & Interval<30, Amount, 0),
Penalty=ifelse(row_number()==1L, NA, Penalty),
CumulativeCost=c(NA, cumsum(Penalty[-1L])))
# ID DaysToEvent HID Interval Amount Penalty CumulativeCost
#1 2197560 16369 011 29 90105 NA NA
#2 2197560 16494 121 29 50526 50526 50526
#3 2197560 16509 121 29 194568 0 50526
#4 2197560 16569 001 31 27236 0 50526
#5 2197560 16577 128 29 17309 17309 67835
#6 2197578 14447 001 29 17276 NA NA
#7 2197578 14468 021 29 12661 12661 12661
#8 2197578 14489 001 31 15015 0 12661
#9 2197578 14517 001 29 19000 0 12661
#10 2197578 14517 02P 29 19001 19001 31662
#11 2197578 14517 001 31 19002 0 31662
#12 2197578 14517 001 29 19003 0 31662
#13 2197578 14517 001 29 19004 0 31662
ifelse
或使用 mycoredata2009 %>%
group_by(ID) %>%
mutate(Penalty=NA^(row_number()==1L)*(HID!=lag(HID, default=0) &
Interval<30)*Amount,
CumulativeCost=c(NA, cumsum(Penalty[-1L])))
data.table