基于其他列中的字符串的虚拟变量列

时间:2019-07-18 21:13:47

标签: r function dplyr dummy-variable

我有一个数据库,其中包含患者编号和他们接受的治疗。我想为每种不同的个人治疗方法(例如,患者接受治疗A,B,C,D一样)设置一个虚拟列。

这是简化的方法,因为我有20多种治疗方法和成千上万的患者,而且我想不出一种简单的方法。

example <- data.frame(id_number = c(0, 1, 2, 3, 4), 
                      treatment = c("A", "A+B+C+D", "C+B", "B+A", "C"))

我想要这样的东西:

desired_result <- data.frame(id_number = c(0, 1, 2, 3, 4), 
                             treatment = c("A", "A+B+C+D", "C+B", "B+A","C"),
                             A=c(1,1,0,1,0), 
                             B=c(0,1,1,1,0),
                             C=c(0,1,1,0,1),
                             D=c(0,1,0,0,0))

2 个答案:

答案 0 :(得分:3)

base版本:

example["A"] <- as.numeric(grepl("A", example[,"treatment"]))
example["B"] <- as.numeric(grepl("B", example[,"treatment"]))
example["C"] <- as.numeric(grepl("C", example[,"treatment"]))
example["D"] <- as.numeric(grepl("D", example[,"treatment"]))

example

  id_number treatment A B C D
1         0         A 1 0 0 0
2         1   A+B+C+D 1 1 1 1
3         2       C+B 0 1 1 0
4         3       B+A 1 1 0 0
5         4         C 0 0 1 0

grepl函数测试每一行中每个模式的存在,as.numeric将逻辑TRUE / FALSE更改为1/0

答案 1 :(得分:2)

一种tidyverse可能是:

example %>%
 mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
 unnest() %>%
 spread(treatment2, treatment2) %>%
 mutate_at(vars(-id_number, -treatment), ~ (!is.na(.)) * 1)

  id_number treatment A B C D
1         0         A 1 0 0 0
2         1   A+B+C+D 1 1 1 1
3         2       C+B 0 1 1 0
4         3       B+A 1 1 0 0
5         4         C 0 0 1 0

或者:

example %>%
 mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
 unnest() %>%
 mutate(val = 1) %>%
 spread(treatment2, val, fill = 0)