使用dplyr重新编码变量

时间:2017-08-21 14:04:43

标签: r dplyr tidyverse

我的数据集中从DX1到DX25,我有ICD 9代码用于肺栓塞,我想在我的数据集(xs)中用ICD 9代码创建一个单独的肺栓塞(PE)列。因此,如果ICD 9代码存在于从DX1到DX25的任何地方,我想将其编码为1,否则我也不想使用for循环,因为我的数据集有30,000行且数据集为7GB所以它崩溃了我的电脑。 目前我正在使用以下代码执行此操作:

xs$PE = NA

xs[which(xs$DX1%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX2%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX3%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1      
xs[which(xs$DX4%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX5%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1           
xs[which(xs$DX6%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX7%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1           
xs[which(xs$DX8%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1           
xs[which(xs$DX9%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1           
xs[which(xs$DX10%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX11%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX12%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX13%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX14%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX15%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX16%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX17%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1          
xs[which(xs$DX18%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1     
xs[which(xs$DX19%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX20%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1
xs[which(xs$DX21%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1  
xs[which(xs$DX22%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1          
xs[which(xs$DX23%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1          
xs[which(xs$DX24%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1          
xs[which(xs$DX25%in% c("41511", "41512", "41513", "41519")), "PE"] <- 1  

xs <- xs %>%
      mutate(PE = ifelse(is.na(PE),0,PE))   

任何人都可以帮助我使用dplyr以更智能的方式重新编码吗?[我不想输入100行代码,因为我有其他诊断要编码并创建新列]

2 个答案:

答案 0 :(得分:0)

当我在过去做过类似的工作时,我使用了tidyr和dplyr的组合。

我的第一步是重塑,然后寻找诊断。 然后可以将数据集减少到每个id的单行。 如果需要,可以将其合并回原始数据集。 但是,我没有尝试使用与您一样大的数据集。

library(tidyr)
library(dplyr)

dat <- data.frame(id = c(1,2,3), 
                  diagnosis1 = c("001", "001", "005"), 
                  diagnosis2 = c("002", "002", "002"), 
                  diagnosis3 = c("003", "005", "004"),
                  stringsAsFactors = FALSE)

dat
  id diagnosis1 diagnosis2 diagnosis3
1  1        001        002        003
2  2        001        002        005
3  3        005        002        004

# reshape long
long_dat <- gather(dat, key = diagnosis_code, value = string, -id)
head(long_dat)
  id diagnosis_code string
1  1     diagnosis1    001
2  2     diagnosis1    001
3  3     diagnosis1    005
4  1     diagnosis2    002
5  2     diagnosis2    002
6  3     diagnosis2    002

long_dat <- long_dat %>% 
  mutate(has_001 = ifelse(string == "001", 1, 0),
         has_4_or_5 = ifelse(string %in% c("004", "005"), 1, 0)
         )

# reduce to single line per id
long_dat <- long_dat %>%
  group_by(id) %>% 
  summarise(has_001 = max(has_001), 
            has_4_or_5 = max(has_4_or_5))
long_dat
# A tibble: 3 x 3
     id has_001 has_4_or_5
  <dbl>   <dbl>      <dbl>
1     1       1          0
2     2       1          1
3     3       0          1

dat <- left_join(dat, long_dat)
dat
  id diagnosis1 diagnosis2 diagnosis3 has_001 has_4_or_5
1  1        001        002        003       1          0
2  2        001        002        005       1          1
3  3        005        002        004       0          1

答案 1 :(得分:0)

这是我编码的方式。

从ICD 9代码[DX1到DX25]为1:DTV [深静脉血栓形成]创建不同疾病的新列。

y <- xs[paste0('DX', 1:25)] 
y[] <- as.matrix(y) %in% c("4532", "4533", "45340", "45341", "45342", "45382", "45383", "45384", "45385", "45386", "45387", "45388", "45389", "4539") 
xs$DTV <- 1 * (rowSums(y) > 0)