我的数据集中从DX1到DX25,我有ICD 9代码用于肺栓塞,我想在我的数据集(xs)中用ICD 9代码创建一个单独的肺栓塞(PE)列。因此,如果ICD 9代码存在于从DX1到DX25的任何地方,我想将其编码为1,否则我也不想使用for循环,因为我的数据集有30,000行且数据集为7GB所以它崩溃了我的电脑。 目前我正在使用以下代码执行此操作:
xs$PE = NA
xs[which(xs$DX1%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX2%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX3%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX4%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX5%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX6%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX7%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX8%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX9%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX10%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX11%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX12%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX13%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX14%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX15%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX16%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX17%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX18%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX19%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX20%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX21%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX22%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX23%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX24%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX25%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs <- xs %>%
mutate(PE = ifelse(is.na(PE),0,PE))
任何人都可以帮助我使用dplyr以更智能的方式重新编码吗?[我不想输入100行代码,因为我有其他诊断要编码并创建新列]
答案 0 :(得分:0)
当我在过去做过类似的工作时,我使用了tidyr和dplyr的组合。
我的第一步是重塑,然后寻找诊断。 然后可以将数据集减少到每个id的单行。 如果需要,可以将其合并回原始数据集。 但是,我没有尝试使用与您一样大的数据集。
library(tidyr)
library(dplyr)
dat <- data.frame(id = c(1,2,3),
diagnosis1 = c("001", "001", "005"),
diagnosis2 = c("002", "002", "002"),
diagnosis3 = c("003", "005", "004"),
stringsAsFactors = FALSE)
dat
id diagnosis1 diagnosis2 diagnosis3
1 1 001 002 003
2 2 001 002 005
3 3 005 002 004
# reshape long
long_dat <- gather(dat, key = diagnosis_code, value = string, -id)
head(long_dat)
id diagnosis_code string
1 1 diagnosis1 001
2 2 diagnosis1 001
3 3 diagnosis1 005
4 1 diagnosis2 002
5 2 diagnosis2 002
6 3 diagnosis2 002
long_dat <- long_dat %>%
mutate(has_001 = ifelse(string == "001", 1, 0),
has_4_or_5 = ifelse(string %in% c("004", "005"), 1, 0)
)
# reduce to single line per id
long_dat <- long_dat %>%
group_by(id) %>%
summarise(has_001 = max(has_001),
has_4_or_5 = max(has_4_or_5))
long_dat
# A tibble: 3 x 3
id has_001 has_4_or_5
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 1
3 3 0 1
dat <- left_join(dat, long_dat)
dat
id diagnosis1 diagnosis2 diagnosis3 has_001 has_4_or_5
1 1 001 002 003 1 0
2 2 001 002 005 1 1
3 3 005 002 004 0 1
答案 1 :(得分:0)
这是我编码的方式。
从ICD 9代码[DX1到DX25]为1:DTV [深静脉血栓形成]创建不同疾病的新列。
y <- xs[paste0('DX', 1:25)]
y[] <- as.matrix(y) %in% c("4532", "4533", "45340", "45341", "45342", "45382", "45383", "45384", "45385", "45386", "45387", "45388", "45389", "4539")
xs$DTV <- 1 * (rowSums(y) > 0)