假设列中的每一行(letter_strings)具有由逗号分隔的可变数量的字符串。例如:
letter_strings
abc, def, ghi, jkl
mno, pqr
stu, vw, xyz
我想在数据框中查找每个字符串:
letter_strings code
abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT
并在附加列中获取以下相应的行
YES, NO, MAYBE, SURE
PERHAPS, ALWAYS
NEVER, NOGO, ABSENT
这在R中是否可行,我真的不知道如何解决这个问题...
提前致谢!
w ^
答案 0 :(得分:3)
1)gusbfn
gsubfn
与gsub
类似,只是它在查找列表的名称中查找与正则表达式(此处定义为"\\w+"
,即一系列单词字符)的匹配项,lookup
,将目标字符串中的名称替换为lookup
中的值。
library(gsubfn)
lookup <- with(DF2, as.list(setNames(code, letter_strings)))
transform(DF1, codes = gsubfn("\\w+", lookup, letter_strings))
,并提供:
letter_strings codes
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
2)dplyr / tidyr 将DF1转换为长格式,将其与DF2连接,然后将其重新整形为原始格式:
library(dplyr)
library(tidyr)
DF1 %>%
mutate(id = 1:n()) %>%
separate_rows(letter_strings) %>%
left_join(DF2) %>%
group_by(id) %>%
summarise(letter_string = toString(letter_strings), codes = toString(code)) %>%
ungroup %>%
select(-id)
,并提供:
Joining, by = "letter_strings"
# A tibble: 3 x 2
letter_string codes
<chr> <chr>
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
3)strsplit / merge / aggregate 使用strsplit
将DF1
和stack
中的字符串拆分为长格式st
。然后merge
DF2
和aggregate
返回原始表单。没有包使用。
s <- strsplit(DF1$letter_strings, ", ")
st <- stack(setNames(s, seq_along(s)))
m <- merge(st, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]
,并提供:
values code
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
3a)magrittr 这可以用magrittr表示:
library(magrittr)
DF1 %>%
"$"("letter_strings") %>%
strsplit(", ") %>%
setNames(seq_along(.)) %>%
stack %>%
merge(DF2, by = 1, all.x = TRUE, all.y = FALSE) %>%
aggregate(. ~ ind, ., toString) %>%
"["(-1)
s <- stack(setNames(strsplit(DF1$letter_strings, ", "), 1:nrow(DF1)))
m <- merge(s, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]
4)data.table 请注意,在下面的评论中,@ Uwe提供了(2)和(3)中转换为长格式的方法的data.table版本,加入并转换回来。
注意:以可重复的形式输入:
Lines1 <- "
letter_strings
abc, def, ghi, jkl
mno, pqr
stu, vw, xyz"
DF1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, sep = ";")
Lines2 <- "
letter_strings code
abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT"
DF2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE)
答案 1 :(得分:0)
如果没有太多字母字符串,您可以在循环中使用gsub
执行此操作。
Temp = letter_strings
for(i in 1:nrow(df)) {
Temp = gsub(df$letter_strings[i], df$code[i], Temp) }
Temp
[1] "YES, NO, MAYBE, SURE" "PERHAPS, ALWAYS" "NEVER, NOGO, ABSENT"