我正在尝试在数据框中创建一个新列,该列将包含取决于同一数据框中多个其他列中的条件的信息。我的研究涉及量化冠状动脉(心脏动脉)闭塞的严重程度。
示例数据框x
是:
structure(list(Study_number = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13,
13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17,
17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 21, 21,
21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25,
25, 25, 26, 26, 26, 26, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29,
29, 30, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 33,
34, 34, 34, 34, 35, 36, 36, 36, 36, 37, 37, 37, 37, 38, 38, 38,
38, 39, 39, 39, 39, 40, 40, 40, 40, 41, 41, 41, 41, 42, 42, 42,
42, 43, 43, 43, 43, 44, 44, 44, 44, 45, 45, 45, 45, 46, 46, 46,
46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 49, 50, 50, 50,
50, 51, 51, 51, 51, 52, 52, 52, 53, 53, 53, 53, 54, 54, 54, 54,
55, 55, 55, 56, 56, 56, 56, 57, 57, 57, 57, 58, 58, 58, 58, 59,
59, 59, 59, 60, 60, 60, 60, 61, 61, 61, 61, 62, 62, 63, 63, 63,
63, 64, 64, 64, 64, 65, 65, 65, 65, 66, 66), Vessel = c(1, 2,
3, 4, 1, 2, 3, 4, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3,
4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1,
2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 1, 2, 3,
4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 1,
2, 3, 4, 2, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1,
2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,
3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3,
4, 1, 2, 3, 4, 1, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 3, 4, 1, 2,
3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3,
4, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3), Segment = c(3,
9, 7, 8, 2, 9, 7, 8, 9, 7, 8, 3, 9, 6, 11, 3, 9, 6, 8, 2, 9,
9, 15, 2, 9, 7, 8, 2, 9, 6, 8, 2, 9, 2, 9, 7, 8, 3, 9, 9, 11,
1, 9, 7, 8, 2, 9, 6, 8, 2, 9, 7, 11, 1, 9, 6, 12, 2, 9, 7, 11,
2, 9, 6, 15, 2, 9, 6, 8, 2, 9, 7, 8, 3, 9, 7, 11, 2, 9, 6, 11,
2, 9, 7, 8, 1, 9, 6, 11, 2, 9, 8, 11, 2, 9, 7, 8, 2, 9, 7, 11,
9, 7, 11, 2, 9, 6, 11, 3, 9, 7, 11, 2, 9, 6, 11, 2, 9, 7, 8,
1, 9, 6, 11, 4, 9, 7, 3, 9, 7, 8, 9, 2, 9, 7, 8, 2, 9, 7, 11,
1, 9, 7, 14, 2, 9, 7, 11, 2, 9, 6, 12, 2, 9, 6, 11, 2, 9, 7,
8, 2, 9, 9, 8, 2, 9, 7, 12, 2, 9, 7, 11, 1, 9, 7, 8, 2, 9, 7,
15, 2, 9, 6, 11, 2, 9, 6, 8, 3, 9, 10, 14, 2, 9, 6, 11, 1, 6,
11, 1, 9, 6, 8, 1, 9, 7, 11, 2, 8, 12, 2, 9, 7, 8, 1, 9, 7, 11,
0, 9, 6, 12, 1, 9, 7, 8, 0, 9, 6, 11, 0, 9, 7, 8, 9, 7, 3, 9,
7, 8, 2, 9, 7, 11, 21, 9, 6, 11, 9, 7), Severity = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Study_number",
"Vessel", "Segment", "Severity"), row.names = c(NA, -250L), class = c("tbl_df",
"tbl", "data.frame"))
实际数据框如下所示:
Study_number Vessel Segment Severity
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 0
2 1 2 9 0
3 1 3 7 0
4 1 4 8 0
5 2 1 2 0
6 2 2 9 0
7 2 3 7 0
8 2 4 8 0
9 3 2 9 0
10 3 3 7 1
每个参与者通常有4艘船(1-4),即使有些参与者可能只有3艘船。我想要实现的是一个名为“Overall_severe_disease”的新列,它应该满足以下条件。
当船只2患有严重疾病时(即Vessel == 2且同一行的严重程度== 1);或
当容器3具有严重疾病的第6段或第7段(即,船舶== 3且段= = 6或7且相应行的严重程度== 1)且至少另一艘船具有严重疾病(即,Severity列的总和== 2); OR
当3个或更多个血管患有严重疾病时(即严重性总和>每个参与者= 3)。
这就是我试图解决这个问题的方法。首先将它们粘贴在一起创建Vessel-Severity列。
x$Vessel_Severity <- paste(x$Vessel, x$Severity, sep = '-')
新数据框将如下所示:
Study_number Vessel Segment Severity Vessel_Severity
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 3 0 1-0
2 1 2 9 0 2-0
3 1 3 7 0 3-0
4 1 4 8 0 4-0
5 2 1 2 0 1-0
6 2 2 9 0 2-0
然后我在plyr
包中使用以下ddply函数将嵌套的ifelse条件应用于每个参与者。
library(plyr)
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1,
ifelse(Vessel_Severity == '2-1', 1,
ifelse(sum(Severity) >= 3, 1, 0))))
之后,我使用以下函数将“Yes”或“No”分配给“Overall_severe_disease”列(如果任何行至少有一个'1',那么它在参与者级别被指定为'是')
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
此方法有效,它为我提供了9个独特的参与者'Overall_severe_disease'
length(unique(x$Study_number[x$Overall_severe_disease=='Yes']))
#9
但是如果我改变ifelse的顺序并将最后一个条件放在我的嵌套ifelse语句(ifelse(sum(Severity) >= 3
)的开头,那么ddply将不会应用除此之外的其余语句,我将完全得到低估的结果(5个独特的参与者而非9个)
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(sum(Severity) >= 3, 1,
ifelse(Vessel_Severity == '2-1', 1,
ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1 , 0))))
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
length(unique(x$Study_number[x$Overall_severe_disease=='Yes']))
#5
我对此行为感到困惑。我会感激一些建议和澄清。
答案 0 :(得分:0)
在您的示例中,您应该替换
x$Vessel_Severity -> paste(x$Vessel, x$Severity, sep = '-')
与
x$Vessel_Severity <- paste(x$Vessel, x$Severity, sep = '-')
尝试重现你的例子,你不能得到9和5 for anwser吗?
# first example
x$Overall_severe_disease<-0
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1, 0))
sum(x$Overall_severe_disease) #4
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1, ifelse(Vessel_Severity == '2-1', 1,0)))
sum(x$Overall_severe_disease) #4
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1,
ifelse(Vessel_Severity == '2-1', 1,
ifelse(sum(Severity) >= 3, 1, 0))))
sum(x$Overall_severe_disease) #24
res<-tapply(x$Overall_severe_disease,x$Study_number,sum)
length(res[res>0])#9
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
length(unique(x$Study_number[x$Overall_severe_disease=='Yes'])) #9
# second example
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(sum(Severity) >= 3, 1, 0))
sum(x$Overall_severe_disease) #20
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(sum(Severity) >= 3, 1, ifelse(Vessel_Severity == '2-1', 1,0)))
sum(x$Overall_severe_disease) #20
x <- ddply(x, 'Study_number', transform,
Overall_severe_disease = ifelse(sum(Severity) >= 3, 1,
ifelse(Vessel_Severity == '2-1', 1,
ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1 , 0))))
sum(x$Overall_severe_disease) #20
res<-tapply(x$Overall_severe_disease,x$Study_number,sum)
length(res[res>0])#5
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
length(unique(x$Study_number[x$Overall_severe_disease=='Yes'])) #5
因此,在第二个示例中,对应于条件ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1, 0))
的4将被删除。这是一个很好的问题。