最后一次观察是以多列

时间:2016-07-28 09:26:29

标签: r tidyr locf

我有一个具有这种结构的数据集:

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA) 
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, L40, K50)

当列L40和K50中出现缺失值时,我希望继续该列中的最后一个非缺失值,条件是ID与先前ID相同,并且当前行中L40和K50中的值为空。我应用了以下代码:

library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)

这无法达到我想要的效果。我希望只有当该行中的其他列(ID除外)为空时,才能将之前的非缺失值转移到下一行。这就是我想要的:

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)  
df3 = data.frame(ID, L40, K50)

2 个答案:

答案 0 :(得分:0)

我们可以使用na.locf

library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
#   ID L40 K50
#1:  1   1  NA
#2:  1   1  NA
#3:  1   1  NA
#4:  1   1  NA
#5:  2   1  NA
#6:  2  NA   1
#7:  3  NA   1
#8:  3  NA   1
#9:  3  NA   1

使用dplyr的选项将是

library(dplyr)
df %>% 
   mutate(ind = rowSums(is.na(.))) %>%
   group_by(ID)  %>%
   mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
   select(-ind)
#      ID   L40   K50
#   <dbl> <dbl> <dbl>
#1     1     1    NA
#2     1     1    NA
#3     1     1    NA
#4     1     1    NA 
#5     2     1    NA
#6     2    NA     1
#7     3    NA     1
#8     3    NA     1
#9     3    NA     1

答案 1 :(得分:0)

我在这个问题上玩了一段时间,由于我对R的了解有限,我想出了以下解决方法。为了说明的目的,我在原始数据框中添加了一个日期列:

ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, date, L40, K50)

这是我做的:

#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)

#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)

#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)

setkey(Dt1, ID, date)
setkey(Dt2, ID, date)

#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present, 
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()

这将诊断的名称延续到下一次观察诊断。