我有一个包含一个人的舞台的数据框,如下所示(这只是非常大的一个示例):
df = structure(list(DeceasedDate = c(0.283219178082192, 1.12678843226788,
2.02865296803653, 0.892465753424658, NA, 0.88013698630137, NA
), LastClinicalEventMonthEnd = c(0.244862981988838, 1.03637744165398,
10.9464611555048, 0.763698598427194, 3.35011412354135, 0.677397228564181,
3.83687211440893), FirstYStage = c("N/A", "2", "2", "2", "2",
"2", "3.1"), SecondYStage = c("N/A", "N/A", "2", "N/A", "2",
"N/A", "3.1"), ThirdYStage = c("N/A", "N/A", "2", "N/A", "2",
"N/A", "3.1"), FourthYStage = c("N/A", "N/A", "N/A", "N/A", "2",
"N/A", "3.1"), FifthYStage = c("N/A", "N/A", "N/A", "N/A", "N/A",
"N/A", "N/A")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-7L))
右边的5列是一个人的一个阶段,但尚未包含所有信息。我需要在前两列中包含这些信息,其中的数字以年为单位,如下所示:
如果第1列中的值小于一年,则FirstYStage应该为“已死”,接下来的所有列也都应为“已死”(此人仍然死亡...);如果该值在1到2之间,则SecondYStage应该为“已死”,依此类推。
如果第2列中的值小于一年,则SecondYStage应该为“ EndOfEvents”;如果该值介于1和2之间,则SecondYStage应该为“ EndOfEvents”,依此类推。
因此,在这种情况下的预期输出应为:
df_updated = structure(list(DeceasedDate = c(0.283219178082192,
1.12678843226788,
2.02865296803653, 0.892465753424658, NA, 0.88013698630137, NA
), LastClinicalEventMonthEnd = c(0.244862981988838, 1.03637744165398,
10.9464611555048, 0.763698598427194, 3.35011412354135, 0.677397228564181,
3.83687211440893), FirstYStage = c("Deceased", "2", "2", "Deceased",
"2", "Deceased", "3.1"), SecondYStage = c("Deceased", "Deceased",
"2", "Deceased", "2", "Deceased", "3.1"), ThirdYStage = c("Deceased",
"Deceased", "Deceased", "Deceased", "2", "Deceased", "3.1"),
FourthYStage = c("Deceased", "Deceased", "Deceased", "Deceased",
"2", "Deceased", "3.1"), FifthYStage = c("Deceased", "Deceased",
"Deceased", "Deceased", "LastEvent", "Deceased", "LastEvent"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))
一个重要的观点是,应该优先考虑“死亡”,换句话说,如果发生冲突,并且一方面存在数字,而“死亡”与之矛盾,那么我们应该更喜欢死亡。
如何以最有效的方式做到这一点?目前我正在做if,但我认为这不是最佳的做法
答案 0 :(得分:1)
这就是我要做的:
value
列由于我对data.table
的流利程度比对dplyr
的流利,这里是用data.table
语法实现的方法。 (抱歉,如果时间允许,我会添加一个dplyr
解决方案。)
library(data.table)
long <- melt(setDT(df)[, rn := .I], measure.vars = patterns("Stage$"))
long[, year := as.integer(variable)] # column index
long[floor(DeceasedDate) < year, value := "Deceased"]
long[is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < year, value := "EndOfEvents"]
dcast(long, rn + DeceasedDate + LastClinicalEventMonthEnd ~ variable)
rn DeceasedDate LastClinicalEventMonthEnd FirstYStage SecondYStage ThirdYStage FourthYStage FifthYStage 1: 1 0.2832192 0.2448630 Deceased Deceased Deceased Deceased Deceased 2: 2 1.1267884 1.0363774 2 Deceased Deceased Deceased Deceased 3: 3 2.0286530 10.9464612 2 2 Deceased Deceased Deceased 4: 4 0.8924658 0.7636986 Deceased Deceased Deceased Deceased Deceased 5: 5 NA 3.3501141 2 2 2 2 EndOfEvents 6: 6 0.8801370 0.6773972 Deceased Deceased Deceased Deceased Deceased 7: 7 NA 3.8368721 3.1 3.1 3.1 3.1 EndOfEvents
如所承诺的,这也是相同方法的dplyr
/ tidyr
实现:
library(tidyr)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
gather(key, val, ends_with("Stage"), factor_key = TRUE) %>%
mutate(year = as.integer(key)) %>%
mutate(val = if_else(!is.na(DeceasedDate) & floor(DeceasedDate) < year, "Deceased", val)) %>%
mutate(val = if_else(is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < year, "EndOfEvents", val)) %>%
select(-year) %>%
spread(key, val) %>%
arrange(rn)
DeceasedDate LastClinicalEventMonthEnd rn FirstYStage SecondYStage ThirdYStage FourthYStage FifthYStage 1 0.2832192 0.2448630 1 Deceased Deceased Deceased Deceased Deceased 2 1.1267884 1.0363774 2 2 Deceased Deceased Deceased Deceased 3 2.0286530 10.9464612 3 2 2 Deceased Deceased Deceased 4 0.8924658 0.7636986 4 Deceased Deceased Deceased Deceased Deceased 5 NA 3.3501141 5 2 2 2 2 EndOfEvents 6 0.8801370 0.6773972 6 Deceased Deceased Deceased Deceased Deceased 7 NA 3.8368721 7 3.1 3.1 3.1 3.1 EndOfEvents
,或者不创建year
列:
df %>%
mutate(rn = row_number()) %>%
gather(key, val, ends_with("Stage"), factor_key = TRUE) %>%
mutate(val = if_else(!is.na(DeceasedDate) & floor(DeceasedDate) < as.integer(key),
"Deceased", val)) %>%
mutate(val = if_else(is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < as.integer(key),
"EndOfEvents", val)) %>%
spread(key, val) %>%
arrange(rn)