我有两个数据框: df1 和 df2
df1<- structure(list(sample_1 = structure(c(7L, 6L, 5L, 1L, 2L, 4L,
3L), .Label = c("P41182;Q9HCP0", "Q09472", "Q9Y6H1;Q5T1J5", "Q9Y6I3",
"Q9Y6Q9", "Q9Y6U3", "Q9Y6W5"), class = "factor"), sample_2 = structure(c(7L,
6L, 4L, 3L, 2L, 5L, 1L), .Label = c("O15143", "P31908", "P3R117",
"P41356;P54612;A41PH2", "P54112", "P61809;Q92831", "Q16835"), class = "factor")), .Names = c("sample_1",
"sample_2"), class = "data.frame", row.names = c(NA, -7L))
df2<- structure(list(subunits..UniProt.IDs. = structure(c(4L, 6L, 5L,
12L, 3L, 9L, 14L, 16L, 15L, 11L, 13L, 8L, 1L, 2L, 10L, 7L), .Label = c("O55102,Q9CWG9,Q5U5M8,Q8VED2,Q91WZ8,Q8R015,Q9R0C0,Q9Z266",
"P30561,O08915,P07901,P11499", "P30561,P53762", "P41182,P56524",
"P41182,Q8WUI4", "P41182,Q9UQL6", "P61160,P61158,O15143,O15144,O15145,P59998,O15511",
"P78537,Q6QNY1,Q6QNY0,Q9NUP1,Q96EV8,Q8TDH9,Q9UL45,O95295", "Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3",
"Q8WMR7,(P67776,P11493),(P54612,P54613)", "Q91VB4,P59438,Q8BLY7",
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7",
"Q92902,Q9NQG7", "Q92903,Q96NY9", "Q969F9,Q9UPZ3,Q86YV9"), class = "factor")), .Names = "subunits..UniProt.IDs.", class = "data.frame", row.names = c(NA,
-16L))
我想查看df1中每个以分号分隔的字符串,如果它包含与df2中逗号分隔的字符串之一的匹配,则将其删除。所以,我的输出将如下所示:
sample_1 sample_2
1 Q9Y6W5 Q16835
2 Q9Y6U3 P61809
3 P41356;A41PH2
4 Q9HCP0 P3R117
5 P31908
6 Q9Y6I3 P54112
7 Q9Y6H1;Q5T1J5
sample_1在第3,4和5行中的字符串与df2中的一个字符串匹配,并删除了那些匹配的字符串。
sample_2在第2,3和7行中包含与df2中的字符串匹配的字符串,并删除那些匹配的字符串。
答案 0 :(得分:2)
首先,您几乎肯定会重新安排您的数据,因此它很整洁,即每个变量都有一列,每个观察一行,但不知道它是什么或它是如何相关的,我不能为您做到这一点。因此,唯一的方法是破解有效列表列:
library(dplyr)
# For each column,
df1 %>% mutate_each(funs(
# convert to character,
as.character(.) %>%
# split each string into a list of strings to evaluate,
strsplit(';') %>%
# loop over the items in each list,
lapply(function(x){
# replacing any in a similarly split and unlisted df2 with NA,
ifelse(x %in% unlist(strsplit(as.character(df2[,1]), '[(),]+')),
NA_character_, x)
}) %>%
# then loop over them again,
sapply(function(x){
# removing NAs where there are non-NA strings.
ifelse(all(is.na(x)), list(NA_character_), list(x[!is.na(x)]))
})))
# sample_1 sample_2
# 1 Q9Y6W5 Q16835
# 2 Q9Y6U3 P61809
# 3 NA P41356, A41PH2
# 4 Q9HCP0 P3R117
# 5 NA P31908
# 6 Q9Y6I3 P54112
# 7 Q9Y6H1, Q5T1J5 NA
如果要将实际列表列折叠回到字符串中,可以使用paste
,但实际上,列表列更有用。
修改强>
如果您的数据足够大以至于让它变得更快是值得的,那么请将df2
的多余部分从链中取出并单独存储,这样您就不会为每次迭代计算它。这是一个内置purrr
的版本,它使用列表而不是data.frames,并且对于非平凡函数可以比mutate_each
更快。根据需要编辑。
library(purrr)
df2_unlisted <- df2 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, '[(),]') %>% # split
unlist() # unlist to vector
df1 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, ';') %>% # split
at_depth(2, ~.x[!.x %in% df2_unlisted]) %>% # subset out unwanted
at_depth(2, ~if(is_empty(.x)) NA_character_ else .x) %>% # insert NA for chr(0)
as_data_frame() %>% data.frame() # for printing
结果完全相同。
答案 1 :(得分:2)
首先,您可以收集要删除的所有可能字符串:
toRmv <- unique(unlist(strsplit(as.character(df2[,1]), ",", fixed = TRUE)))
toRmv <- gsub("\\W", "", toRmv, perl = TRUE)
然后删除它们。我喜欢这里的stringi
包,因为它能够使用设置为FALSE的方便vectorize_all
参数替换空字符串的多个字符串。
library(stringi)
df1[] <- lapply(df1, stri_replace_all_fixed,
pattern = toRmv, replacement = "", vectorize_all = FALSE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809;
#3 P41356;;A41PH2
#4 ;Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
现在,只需要删除前导分号(^;
),尾随分号(;$
)和多个分号((?<=;);
):
df1[] <- lapply(df1, gsub, pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809
#3 P41356;A41PH2
#4 Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
根据评论的要求,这里是功能形式。我没有测试这部分。您可以随意测试和调整:
stringRemove <- function(removeFrom, toRemove) {
library(stringi)
toRemove <- unique(unlist(strsplit(as.character(toRemove), ",", fixed = TRUE)))
toRemove <- gsub("\\W", "", toRemove, perl = TRUE)
removeFrom[] <- lapply(removeFrom, stri_replace_all_fixed,
pattern = toRemove, replacement = "", vectorize_all = FALSE)
removeFrom[] <- lapply(removeFrom, gsub,
pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
removeFrom
}
# use it
stringRemove(removeFrom = df1, toRemove = df2[,1])