我有一个数据集,其中包含IN_TIME_date,OUT_TIME_date,2012-2018年的日期,我想计算每年的住院日,但是,有些患者的IN_TIME_date,OUT_TIME_date不在同一年。我该如何计算?非常感谢。
> dput(as.data.frame(Demography_newdata0129)) structure(list(CASEID = c("023252(1)", "07597558(2)", "07597558(3)", "100520(31)", "100520(32)", "100520(33)", "100520(34)", "10056(1)", "101171(4)", "101171(5)", "101455(2)", "101557(2)", "101571(3)", "101571(4)", "101571(5)", "101571(6)", "10160(5)", "101637(2)", "101893(13)", "101893(15)", "101893(16)", "102807(4)", "102807(5)", "102862(12)"), IN_TIME_date = c("2017-02-25", "2015-10-23", "2016-07-06", "2013-01-23", "2013-03-12", "2013-06-13", "2013-10-08", "2016-02-20", "2015-09-24", "2015-10-19", "2014-05-01", "2015-12-11", "2014-08-26", "2015-07-21", "2016-01-06", "2017-03-20", "2014-04-14", "2017-04-25", "2014-08-10", "2017-02-06", "2017-04-12", "2016-01-19", "2016-06-08", "2012-10-19"), OUT_TIME_date = c("2017-03-02", "2015-12-05", "2016-07-15", "2013-01-28", "2013-03-18", "2013-06-18", "2013-10-15", "2016-02-29", "2015-10-19", "2015-11-02", "2014-05-28", "2016-01-15", "2015-07-21", "2016-01-06", "43179", "2017-12-14", "2014-06-14", "2017-05-09", "2014-08-21", "2017-02-11", "2017-04-20", "2016-01-24", "2016-06-15", "2013-01-25"), LOS = c(5, 43, 9, 5, 6, 5, 7, 9, 25, 14, 27, 35, 329, 169, 804, 269, 61, 14, 11, 5, 8, 5, 7, 98 ),
2012= c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 73),
2013= c(0, 0, 0, 5, 6, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25),
2014= c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 127, 0, 0, 0, 61, 0, 11, 0, 0, 0, 0, 0),
2015= c(0, 43, 0, 0, 0, 0, 0, 0, 25, 14, 0, 20, 201, 163, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
2016= c(0, 0, 9, 0, 0, 0, 0, 9, 0, 0, 0, 15, 0, 5, 360, 0, 0, 0, 0, 0, 0, 5, 7, 0),
2017= c(5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 365, 269, 0, 14, 0, 5, 8, 0, 0, 0),
2018= c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -24L), class = "data.frame")
答案 0 :(得分:0)
首先,有两条评论:
dput
引发错误。首先,您缺少日期列名称的单勾号;其次,OUT_TIME_date
中的CASEID = 101571(5)
似乎有错误。对于以后的帖子,请仔细检查示例数据(1)不包含R语法错误,并且(2)是正确且具有代表性。这里是利用lubridate
函数处理Date
s的一种选择
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(ends_with("date")), ~as.Date(.x, format = "%Y-%m-%d")) %>%
mutate(
LOS = difftime(OUT_TIME_date, IN_TIME_date, units = "days"),
grp = map2(
IN_TIME_date,
OUT_TIME_date,
~year(seq.Date(.x, .y - 1, by = "day")))) %>%
unnest() %>%
group_by_all() %>%
tally() %>%
spread(grp, n, fill = 0) %>%
ungroup() %>%
as.data.frame()
# CASEID IN_TIME_date OUT_TIME_date LOS 2012 2013 2014 2015 2016
#1 023252(1) 2017-02-25 2017-03-02 5 days 0 0 0 0 0
#2 07597558(2) 2015-10-23 2015-12-05 43 days 0 0 0 43 0
#3 07597558(3) 2016-07-06 2016-07-15 9 days 0 0 0 0 9
#4 100520(31) 2013-01-23 2013-01-28 5 days 0 5 0 0 0
#5 100520(32) 2013-03-12 2013-03-18 6 days 0 6 0 0 0
#6 100520(33) 2013-06-13 2013-06-18 5 days 0 5 0 0 0
#7 100520(34) 2013-10-08 2013-10-15 7 days 0 7 0 0 0
#8 10056(1) 2016-02-20 2016-02-29 9 days 0 0 0 0 9
#9 101171(4) 2015-09-24 2015-10-19 25 days 0 0 0 25 0
#10 101171(5) 2015-10-19 2015-11-02 14 days 0 0 0 14 0
#11 101455(2) 2014-05-01 2014-05-28 27 days 0 0 27 0 0
#12 101557(2) 2015-12-11 2016-01-15 35 days 0 0 0 21 14
#13 101571(3) 2014-08-26 2015-07-21 329 days 0 0 128 201 0
#14 101571(4) 2015-07-21 2016-01-06 169 days 0 0 0 164 5
#15 101571(5) 2016-01-06 2018-03-20 804 days 0 0 0 0 361
#16 101571(6) 2017-03-20 2017-12-14 269 days 0 0 0 0 0
#17 10160(5) 2014-04-14 2014-06-14 61 days 0 0 61 0 0
#18 101637(2) 2017-04-25 2017-05-09 14 days 0 0 0 0 0
#19 101893(13) 2014-08-10 2014-08-21 11 days 0 0 11 0 0
#20 101893(15) 2017-02-06 2017-02-11 5 days 0 0 0 0 0
#21 101893(16) 2017-04-12 2017-04-20 8 days 0 0 0 0 0
#22 102807(4) 2016-01-19 2016-01-24 5 days 0 0 0 0 5
#23 102807(5) 2016-06-08 2016-06-15 7 days 0 0 0 0 7
#24 102862(12) 2012-10-19 2013-01-25 98 days 74 24 0 0 0
# 2017 2018
#1 5 0
#2 0 0
#3 0 0
#4 0 0
#5 0 0
#6 0 0
#7 0 0
#8 0 0
#9 0 0
#10 0 0
#11 0 0
#12 0 0
#13 0 0
#14 0 0
#15 365 78
#16 269 0
#17 0 0
#18 14 0
#19 0 0
#20 5 0
#21 8 0
#22 0 0
#23 0 0
#24 0 0
这个想法是根据IN_TIME_date
和OUT_TIME_date
生成一个日期的逐日序列;然后,我们仅从这些序列中提取年份,并计算每个CASEID
的年份。剩下的就是数据的基本整理/整形。
df <- structure(list(CASEID = c("023252(1)", "07597558(2)", "07597558(3)",
"100520(31)", "100520(32)", "100520(33)", "100520(34)", "10056(1)",
"101171(4)", "101171(5)", "101455(2)", "101557(2)", "101571(3)",
"101571(4)", "101571(5)", "101571(6)", "10160(5)", "101637(2)",
"101893(13)", "101893(15)", "101893(16)", "102807(4)", "102807(5)",
"102862(12)"), IN_TIME_date = c("2017-02-25", "2015-10-23", "2016-07-06",
"2013-01-23", "2013-03-12", "2013-06-13", "2013-10-08", "2016-02-20",
"2015-09-24", "2015-10-19", "2014-05-01", "2015-12-11", "2014-08-26",
"2015-07-21", "2016-01-06", "2017-03-20", "2014-04-14", "2017-04-25",
"2014-08-10", "2017-02-06", "2017-04-12", "2016-01-19", "2016-06-08",
"2012-10-19"), OUT_TIME_date = c("2017-03-02", "2015-12-05",
"2016-07-15", "2013-01-28", "2013-03-18", "2013-06-18", "2013-10-15",
"2016-02-29", "2015-10-19", "2015-11-02", "2014-05-28", "2016-01-15",
"2015-07-21", "2016-01-06", "2018-03-20", "2017-12-14", "2014-06-14",
"2017-05-09", "2014-08-21", "2017-02-11", "2017-04-20", "2016-01-24",
"2016-06-15", "2013-01-25")), row.names = c(NA, -24L), class = "data.frame")