从两个数据框中查找和删除匹配的子字符串

时间:2016-06-26 16:30:49

标签: r string

我有两个数据框: df1 df2

df1<- structure(list(sample_1 = structure(c(7L, 6L, 5L, 1L, 2L, 4L, 
3L), .Label = c("P41182;Q9HCP0", "Q09472", "Q9Y6H1;Q5T1J5", "Q9Y6I3", 
"Q9Y6Q9", "Q9Y6U3", "Q9Y6W5"), class = "factor"), sample_2 = structure(c(7L, 
6L, 4L, 3L, 2L, 5L, 1L), .Label = c("O15143", "P31908", "P3R117", 
"P41356;P54612;A41PH2", "P54112", "P61809;Q92831", "Q16835"), class = "factor")), .Names = c("sample_1", 
"sample_2"), class = "data.frame", row.names = c(NA, -7L))


df2<- structure(list(subunits..UniProt.IDs. = structure(c(4L, 6L, 5L, 
12L, 3L, 9L, 14L, 16L, 15L, 11L, 13L, 8L, 1L, 2L, 10L, 7L), .Label = c("O55102,Q9CWG9,Q5U5M8,Q8VED2,Q91WZ8,Q8R015,Q9R0C0,Q9Z266", 
"P30561,O08915,P07901,P11499", "P30561,P53762", "P41182,P56524", 
"P41182,Q8WUI4", "P41182,Q9UQL6", "P61160,P61158,O15143,O15144,O15145,P59998,O15511", 
"P78537,Q6QNY1,Q6QNY0,Q9NUP1,Q96EV8,Q8TDH9,Q9UL45,O95295", "Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3", 
"Q8WMR7,(P67776,P11493),(P54612,P54613)", "Q91VB4,P59438,Q8BLY7", 
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7", 
"Q92902,Q9NQG7", "Q92903,Q96NY9", "Q969F9,Q9UPZ3,Q86YV9"), class = "factor")), .Names = "subunits..UniProt.IDs.", class = "data.frame", row.names = c(NA, 
-16L))

我想查看df1中每个以分号分隔的字符串,如果它包含与df2中逗号分隔的字符串之一的匹配,则将其删除。所以,我的输出将如下所示:

       sample_1                  sample_2
1        Q9Y6W5                   Q16835
2        Q9Y6U3                   P61809
3                          P41356;A41PH2
4        Q9HCP0                   P3R117
5                                 P31908
6        Q9Y6I3                   P54112
7 Q9Y6H1;Q5T1J5                   

sample_1在第3,4和5行中的字符串与df2中的一个字符串匹配,并删除了那些匹配的字符串。
sample_2在第2,3和7行中包含与df2中的字符串匹配的字符串,并删除那些匹配的字符串。

2 个答案:

答案 0 :(得分:2)

首先,您几乎肯定会重新安排您的数据,因此它很整洁,即每个变量都有一列,每个观察一行,但不知道它是什么或它是如何相关的,我不能为您做到这一点。因此,唯一的方法是破解有效列表列:

library(dplyr)

        # For each column,
df1 %>% mutate_each(funs(
    # convert to character,
    as.character(.) %>% 
        # split each string into a list of strings to evaluate,
        strsplit(';') %>% 
        # loop over the items in each list,
        lapply(function(x){
            # replacing any in a similarly split and unlisted df2 with NA,
            ifelse(x %in% unlist(strsplit(as.character(df2[,1]), '[(),]+')), 
                NA_character_, x)
        }) %>% 
        # then loop over them again,
        sapply(function(x){
            # removing NAs where there are non-NA strings.
            ifelse(all(is.na(x)), list(NA_character_), list(x[!is.na(x)]))
        })))

#         sample_1       sample_2
# 1         Q9Y6W5         Q16835
# 2         Q9Y6U3         P61809
# 3             NA P41356, A41PH2
# 4         Q9HCP0         P3R117
# 5             NA         P31908
# 6         Q9Y6I3         P54112
# 7 Q9Y6H1, Q5T1J5             NA

如果要将实际列表列折叠回到字符串中,可以使用paste,但实际上,列表列更有用。

修改

如果您的数据足够大以至于让它变得更快是值得的,那么请将df2的多余部分从链中取出并单独存储,这样您就不会为每次迭代计算它。这是一个内置purrr的版本,它使用列表而不是data.frames,并且对于非平凡函数可以比mutate_each更快。根据需要编辑。

library(purrr)

df2_unlisted <- df2 %>% map(as.character) %>%    # convert; unnecessary if stringsAsFactors = FALSE
                    map(strsplit, '[(),]') %>%    # split
                    unlist()    # unlist to vector

df1 %>% map(as.character) %>%    # convert; unnecessary if stringsAsFactors = FALSE
    map(strsplit, ';') %>%    # split
    at_depth(2, ~.x[!.x %in% df2_unlisted]) %>%    # subset out unwanted
    at_depth(2, ~if(is_empty(.x)) NA_character_ else .x) %>%    # insert NA for chr(0)
    as_data_frame() %>% data.frame()    # for printing

结果完全相同。

答案 1 :(得分:2)

首先,您可以收集要删除的所有可能字符串:

toRmv <- unique(unlist(strsplit(as.character(df2[,1]), ",", fixed = TRUE)))
toRmv <- gsub("\\W", "", toRmv, perl = TRUE)

然后删除它们。我喜欢这里的stringi包,因为它能够使用设置为FALSE的方便vectorize_all参数替换空字符串的多个字符串。

library(stringi) 
df1[] <- lapply(df1, stri_replace_all_fixed,
    pattern = toRmv, replacement = "", vectorize_all = FALSE)
df1
#       sample_1       sample_2
#1        Q9Y6W5         Q16835
#2        Q9Y6U3        P61809;
#3               P41356;;A41PH2
#4       ;Q9HCP0         P3R117
#5                       P31908
#6        Q9Y6I3         P54112
#7 Q9Y6H1;Q5T1J5 

现在,只需要删除前导分号(^;),尾随分号(;$)和多个分号((?<=;);):

df1[] <- lapply(df1, gsub, pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
df1
#       sample_1      sample_2
#1        Q9Y6W5        Q16835
#2        Q9Y6U3        P61809
#3               P41356;A41PH2
#4        Q9HCP0        P3R117
#5                      P31908
#6        Q9Y6I3        P54112
#7 Q9Y6H1;Q5T1J5            

根据评论的要求,这里是功能形式。我没有测试这部分。您可以随意测试和调整:

stringRemove <- function(removeFrom, toRemove) {

    library(stringi)
    toRemove <- unique(unlist(strsplit(as.character(toRemove), ",", fixed = TRUE)))
    toRemove <- gsub("\\W", "", toRemove, perl = TRUE)

    removeFrom[] <- lapply(removeFrom, stri_replace_all_fixed,
            pattern = toRemove, replacement = "", vectorize_all = FALSE)
    removeFrom[] <- lapply(removeFrom, gsub,
         pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
    removeFrom
}

# use it
stringRemove(removeFrom = df1, toRemove = df2[,1])