我正在尝试使用 R编程语言将特定的字符串模式转换为三个不同列的二进制列。
这是我所拥有的:
have <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad",
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp",
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china",
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china",
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing",
"giad", "china", "china", "china, camp", "camp", "na", "na",
"cis", "trans", "stron, mon")), row.names = c(NA, -17L), class = c("data.table",
"data.frame"))
这就是我想要的:
want <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad",
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp",
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china",
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china",
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing",
"giad", "china", "china", "china, camp", "camp", "na", "na",
"cis", "trans", "stron, mon"), rep1_chi = c(1, 0, 0, 0, 0, 1,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep2_chi = c(1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep3_chi = c(0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), rep1_bay = c(0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_bay = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_bay = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-17L))
我能够使用ifelse
和stringr::str_detect
创建一个可行的解决方案,如下所示:
want <- have %>% dplyr::select(rep1, rep2, rep3) %>% mutate(
rep1_chi = ifelse(str_detect(rep1,"chi") == T,1,0),
rep2_chi = ifelse(str_detect(rep2,"chi") == T,1,0),
rep3_chi = ifelse(str_detect(rep3,"chi") == T,1,0),
rep1_bay = ifelse(str_detect(rep1,"bay") == T,1,0),
rep2_bay = ifelse(str_detect(rep2,"bay") == T,1,0),
rep3_bay = ifelse(str_detect(rep3,"bay") == T,1,0),
rep1_gia = ifelse(str_detect(rep1,"gia") == T,1,0),
rep2_gia = ifelse(str_detect(rep2,"gia") == T,1,0),
rep3_gia = ifelse(str_detect(rep3,"gia") == T,1,0),
rep1_sin = ifelse(str_detect(rep1,"sin") == T,1,0),
rep2_sin = ifelse(str_detect(rep2,"sin") == T,1,0),
rep3_sin = ifelse(str_detect(rep3,"sin") == T,1,0))
我最大的问题是,它似乎很重复。 我想知道是否有更优雅的解决方案?考虑到“ rep”列的编号是1-3,我认为可能会有更好的编程方法。
通过SO,我发现使用model.matrix
的{{3}}似乎在需要每种模式并且只对单个列感兴趣的情况下效果很好。我尝试将其转换为一个函数,以便可以选择多个列-但我仍然必须删除不感兴趣的模式的字符串。
答案 0 :(得分:2)
这是使用mutate_all
的方法。如果您只想对特定列执行此操作,则只需使用mutate_at
并指定列即可。
library(dplyr)
library(stringr)
mutate_all(have, funs(chi = as.numeric(str_detect(., "chi")),
bay = as.numeric(str_detect(., "bay")),
gia = as.numeric(str_detect(., "gia")),
sin = as.numeric(str_detect(., "sin"))))
mutate_at
示例,其中包含vars
:
want <- have %>% mutate_at(vars(rep1,rep2,rep3), funs(
tox = as.numeric(str_detect(., "chi")),
bay = as.numeric(str_detect(., "bay")),
gia = as.numeric(str_detect(., "gia")),
iso = as.numeric(str_detect(., "sin"))))
答案 1 :(得分:1)
这里有一些丑陋且效率低下的(性能方面的)基本代码,您不必自己构造colname:
want_new <- have
colold <- colnames(want_new)
for (p in pattern) {
cname <- paste0(
colold,
"_",
p
)
for (col in cname) {
want_new[, col] <- as.numeric(str_detect(
want_new[, gsub(paste0("_", p), "", col, fixed)],
p
))
}
}
可以肯定,可以通过进一步调整来改善这一点。