我有两个数据框,我想进行匹配和合并。 最初,我使用inner_join并合并,但意识到匹配部分未正确匹配。
我发现了一个似乎朝着正确方向How to merge two data frame based on partial string match with R?的例子。建议使用此代码一个答案:
activeOpacity
但是它没有达到要求。问题是要用作模式匹配的数据集,其字符串比我要匹配的字符串长,因此没有任何匹配。让我显示数据的子集:
idx2 <- sapply(df_mouse_human$Protein.IDs, grep, df_mouse$Protein.IDs)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
merged <- cbind(df_mouse_human[unlist(idx1),,drop=F], df_mouse[unlist(idx2),,drop=F])
因此,我想将dput(droplevels(df_mouse))
structure(list(Protein.IDs = c("Q8CBM2;A2AL85;Q8BSY0", "A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8",
"A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6", "Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15379-2;P15379-3;P15379-6;P15379-11;P15379-5;P15379-10;P15379-9;P15379-4;P15379-8;P15379-7;P15379;P15379-12;P15379-13",
"A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78", "A2AUR7;Q9D031;Q01730"
), Replicate = c(2L, 2L, 2L, 2L, 2L, 2L), Ratio.H.L.normalized.01 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.02 = c(NaN, NaN,
NaN, NaN, NaN, NaN), Ratio.H.L.normalized.03 = c(NaN, NaN, NaN,
NaN, NaN, NaN)), .Names = c("Protein.IDs", "Replicate", "Ratio.H.L.normalized.01",
"Ratio.H.L.normalized.02", "Ratio.H.L.normalized.03"), row.names = 12:17, class = "data.frame")
dput(droplevels(df_mouse_human))
structure(list(Human = c("Q8WZ42", "Q8NF91", "Q9UPN3", "Q96RW7",
"Q8WXG9", "P20929", "Q5T4S7", "O14686", "Q2LD37", "Q92736"),
Protein.IDs = c("A2ASS6", "Q6ZWR6", "Q9QXZ0", "D3YXG0", "Q8VHN7",
"E9Q1W3", "A2AN08", "Q6PDK2", "A2AAE1", "E9Q401")), .Names = c("Human",
"Protein.IDs"), row.names = c(NA, 10L), class = "data.frame")
中的Protein.ID与它们在df_mouse
中存在的位置进行匹配。在示例数据中,我尝试将A2ASS6; E9Q8N1; E9Q8K5; A2ASS6-2; A2AT70; F7CR78与条目A2ASS6匹配。如果我以其他方式进行操作,则效果很好,但是有没有一种方式,如果模式的一部分与查询匹配,它将返回TRUE?
我的长期目标是匹配和合并数据,以使df_mouse带有匹配的人类蛋白质ID的新列,如果没有匹配,我将用原始的鼠标ID字符串替换NA值
谢谢
答案 0 :(得分:2)
我通常在部分匹配中使用的一种方法是减少更复杂的字段,使其看起来更简单。有时,这仅涉及删除多余的字符(例如,如果“仅匹配前四个字符”,那么我将从substr(idcol, 1, 4)
中创建一个新的索引列并加入该列),但是在这种情况下,它涉及到破坏一个串成多个。
这涉及将每个用分号分隔的id与大字符串相关联,从而使此中间帧比原始数据更高(有时更高)。
(出于展示性/美学的考虑,我正在修改df1
以删除其他不变列,并为了“其他数据”而添加行号列。)
我正在使用dplyr
和tidyr
,所以:
library(dplyr)
library(tidyr)
df1 <- select(df1, Protein.IDs) %>%
mutate(other = row_number())
首先,我将6行帧分解为更大的帧:
df1ids <- tbl_df(df1) %>%
select(Protein.IDs) %>%
mutate(eachID = strsplit(Protein.IDs, ";")) %>%
unnest()
df1ids
# # A tibble: 46 x 2
# Protein.IDs eachID
# <chr> <chr>
# 1 Q8CBM2;A2AL85;Q8BSY0 Q8CBM2
# 2 Q8CBM2;A2AL85;Q8BSY0 A2AL85
# 3 Q8CBM2;A2AL85;Q8BSY0 Q8BSY0
# 4 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH3
# 5 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH5
# 6 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH4
# 7 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893
# 8 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893-2
# 9 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH8
# 10 A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 A2AMW0
# # ... with 36 more rows
注意三列的第一行现在变成三列的三列。我们将使用"eachID"
来加入。
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6
如果您碰巧每个Human
都有多个Proteins.IDs
行,那么情况会有所变化。
df2$Protein.IDs[2] <- "E9Q8K5"
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 7 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 Q8NF91 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 7 <NA> A2AUR7;Q9D031;Q01730 6
请注意,您现在如何拥有other
5的两个副本?可能不是您想要的。但是,如果您打算继续使用以分号分隔的主题,则:
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
group_by(Protein.IDs) %>%
summarize(Human = paste(Human, collapse = ";")) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM~ 4
# 5 Q8WZ42;Q8N~ A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6
答案 1 :(得分:1)
@ r2evans提出了一个很好的问题,即如何处理多个匹配项。回答完该问题后,我可能需要编辑答案,但这是一种快速解决方案。首先,我们拆分可能的ID的字符串,然后查看在另一个数据框中匹配的ID,然后加入匹配项的行索引。
library(tidyverse)
df_mouse %>% mutate(all_id = str_split(Protein.IDs, ";"),
row = map(all_id, ~.x %in% df_mouse_human$Protein.IDs %>% which())) %>%
unnest(row) %>%
list(., df_mouse_human %>% rownames_to_column("row") %>% mutate(row = as.numeric(row))) %>%
reduce(left_join, by = "row")
#> Protein.IDs.x Replicate
#> 1 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 2
#> Ratio.H.L.normalized.01 Ratio.H.L.normalized.02 Ratio.H.L.normalized.03
#> 1 NaN NaN NaN
#> row Human Protein.IDs.y
#> 1 1 Q8WZ42 A2ASS6