Question

我正在尝试比较R中的2个数据帧：

Keggs <- c("K001", "K002", "K003", "K004", "K005", "K006", "K007", "K008")
names <- c("Acaryochloris", "Proteobacteria", "Parvibaculum", "Alphaproteobacteria", "Rhodospirillum", "Magnetospirillum", "Coraliomargarita", "Bacteria")
family <- c("Proteos", "Cyanobacteria", "Rhizo", "Nostocales", "Bacteroidetes")
species <- c("Alphaproteobacteria", "Purrsia", "Parvibaculum", "Chico", "Rhodospirillum")
res <- data.frame(Keggs, names)
result <- data.frame(family, species)

现在，我想要做的是将结果$ species中的每个字符串与res$names进行比较。

如果匹配，我希望它返回同一行的result$family中的字符串，以及res $ Keggs中的字符串作为单独的数据帧。

然后最终结果如下：

> df3
Keggs family
K003  Rhizo
K004  Proteos
K005  Bacteroidetes

我已经搜索了如何比较R中的data.frames和我发现的最接近的数据： compare df1 column 1 to all columns in df2 returning the index of df2

但是这会返回T / F，res df是2列。

在我的搜索中，我遇到了使用基础R中的match()和merge()函数;我正在和一个＆＃34; res＆＃34; df是11,000,000行，我的＆＃34;结果＆＃34; df小于1,000行。在匹配文档中，它指出：match(x, table, ...) 在表格下：＆＃34;不支持长矢量＆＃34;所以，我不认为match（）或merge（）（由于我实际df＆＃39; s的庞大规模）方法是最优雅的。我尝试了一个循环，但我的循环技能有限，并且随意丢弃。

对于对这个难题的任何见解，我都会非常感激。

提前谢谢你， Purrsia

Answer 1

您可以尝试tidyverse函数：

df3 <- res %>% 
  inner_join(result, by = c("names" = "species")) %>%
  select(Keggs, family)

给出了

  Keggs        family
1  K003         Rhizo
2  K004       Proteos
3  K005 Bacteroidetes

Answer 2

我们可以使用data.table

library(data.table)
na.omit(setDT(res)[result, on = c("names" = "species")])[, names := NULL][]
#   Keggs        family
#1:  K004       Proteos
#2:  K003         Rhizo
#3:  K005 Bacteroidetes

比较R中的2个数据帧：在df2 $ V2中搜索df1 $ V2中的字符串，并在df2 $ V1

2 个答案: