Question

我有一个具有这种结构的数据集：

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA) 
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, L40, K50)

当列L40和K50中出现缺失值时，我希望继续该列中的最后一个非缺失值，条件是ID与先前ID相同，并且当前行中L40和K50中的值为空。我应用了以下代码：

library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)

这无法达到我想要的效果。我希望只有当该行中的其他列（ID除外）为空时，才能将之前的非缺失值转移到下一行。这就是我想要的：

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)  
df3 = data.frame(ID, L40, K50)

Answer 1

我们可以使用na.locf

library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
#   ID L40 K50
#1:  1   1  NA
#2:  1   1  NA
#3:  1   1  NA
#4:  1   1  NA
#5:  2   1  NA
#6:  2  NA   1
#7:  3  NA   1
#8:  3  NA   1
#9:  3  NA   1

使用dplyr的选项将是

library(dplyr)
df %>% 
   mutate(ind = rowSums(is.na(.))) %>%
   group_by(ID)  %>%
   mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
   select(-ind)
#      ID   L40   K50
#   <dbl> <dbl> <dbl>
#1     1     1    NA
#2     1     1    NA
#3     1     1    NA
#4     1     1    NA 
#5     2     1    NA
#6     2    NA     1
#7     3    NA     1
#8     3    NA     1
#9     3    NA     1

Answer 2

我在这个问题上玩了一段时间，由于我对R的了解有限，我想出了以下解决方法。为了说明的目的，我在原始数据框中添加了一个日期列：

ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, date, L40, K50)

这是我做的：

#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)

#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)

#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)

setkey(Dt1, ID, date)
setkey(Dt2, ID, date)

#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present, 
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()

这将诊断的名称延续到下一次观察诊断。

最后一次观察是以多列

2 个答案: