将数据帧中的两列与另一个数据帧中的多列匹配,返回第一个匹配列

时间:2018-01-17 02:32:44

标签: r statistics bioinformatics

我试图将数据帧中的两列与另一个数据帧匹配,我希望返回的值是第一个与两个初始列匹配的第二个数据帧中的值。

例如: 我想采用以下数据框:

Fasta<-c("X1","X1","X2","X2","X3","X3")
Species<-c("Kiwi","Chicken","Weta","Cricket","Tuatara","Gecko")
testdata<-as.data.frame(cbind(Fasta,Species))
testdata<-aggregate(Species ~ Fasta, testdata, I)
testdata<-aggregate(Species ~ Fasta, testdata, I)

Fasta    Species1 Species2

X1       Kiwi      Chicken
X2       Weta      Cricket
X3       Tuatara   Gecko

以下是我的第二个数据框

Species<-c("Kiwi","Chicken","Weta","Cricket","Frog","Gecko")
Genus<-c("Orn","Norn","Genus2","Genus2","Spec","NoSpec")
Order<-c("Bird","Bird","Order2","Order2","Norder","Geckn")
Kingdom<-rep("Animal",each=6)
lookup<-data.frame(cbind(Species,Genus,Order,Kingdom))

Species Genus   Order   Kingdom

Kiwi    Orn     Bird    Animal
Chicken Norn    Bird    Animal
Weta    Genus2  Order2  Animal
Cricket Genus2  Order2  Animal
Frog    Spec    Norder  Animal
Gecko   NoSpec  Geckn   Animal

我想找到第二个数据框中与Species1和Species2匹配的第一列并返回其名称。理想情况下,这会给我以下输出:

Fasta   Species1    Species2    MatchLevel

X1      Kiwi        Chicken     Order
X2      Weta        Cricket     Genus
X3      Tuatara     Gecko       Kingdom

以不同格式打开数据,

1 个答案:

答案 0 :(得分:0)

该功能利用了分类群的嵌套性(即,如果两个物种属于同一属,则它们必须处于相同的顺序,等等)。同一属中的两个物种得分为3,因为所有3个分类水平匹配,如果在相同的顺序中则为2,如果在同一个王国中则为1。也不可能匹配。

match2species <- function(a, b, lookup_table = lookup) {
  sp_a <- lookup_table[lookup_table$Species == a, ]
  sp_b <- lookup_table[lookup_table$Species == b, ]

  matches <- sum(sp_a[-1] == sp_b[-1])
  ifelse(matches > 0, c('Kingdom','Order','Genus')[matches], 'No match')

}

可以为数据框中的任何物种对调用该函数。

> match2species('Chicken','Kiwi')
[1] "Order"
> match2species('Weta','Cricket')
[1] "Genus"
> match2species('Frog','Gecko')
[1] "Kingdom"