R:在两个数据帧中匹配公共行(ID)

时间:2018-08-20 22:14:48

标签: r merge

我正在尝试在两个数据框中找到常见的基因ID。两者在行中具有相同的唯一标识符(列A)。理想情况下,我将创建一个新的数据框,该框保留行名,并将基因表达数据仅放在列中。以下是我的数据示例(关注的列是col 1,它是标识符,col 4:9,我需要进行比较):

RefSeq. ID       C1      C2      C3      C4      C5      C6      
NP_000005   8.57345 8.45938 8.68941 8.35913 8.48177 8.44560 
NP_000010   8.32595 8.19273 8.10708 8.48156 7.99014 8.24859 

我想表演的是对Refseq的匹配。 ID列,为每行匹配相似的唯一标识符。我将比较C1-C6与两个数据帧。

我至少可以使用以下代码行查看匹配项:

> x008[, 1] %in% x007[, 1]

但是对于每次比赛,它只是返回了一系列FALSE TRUE结果。然后,我尝试了以下两行代码,但均无济于事!?!

> mydata <- merge(x008, x007, by=c("RefSeq. ID"))
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

> match(x008$RefSeq. ID, x007$RefSeq. ID)
Error: unexpected symbol in "match(x008$RefSeq. ID"

1 个答案:

答案 0 :(得分:2)

我无法完全重现您的问题。以下作品

merge(df1, df2, by = "RefSeq. ID")
#  RefSeq. ID UniProt.x  Protein.Name.x    C1.x    C2.x    C3.x UniProt.y
#1  NP_000005    P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941    P01023
#2  NP_000021    P21549  Serine--pyruva 9.67506 9.04974 8.92981    P21549
# Protein.Name.y     C1.y     C2.y     C3.y
#1 Alpha-2-macrogl 18.57345 18.45938 18.68941
#2  Serine--pyruva 19.67506 19.04974 18.92981
在两个"RefSeq. ID"中,

data.frame必须是 unique 列。


样本数据

df1 <- read.table(text =
    "'RefSeq. ID'  UniProt 'Protein Name'    C1      C2      C3
NP_000005   P01023  Alpha-2-macrogl 8.57345 8.45938 8.68941
NP_000010   P24752  Acetyl-CoA      8.32595 8.19273 8.10708
NP_000021   P21549  Serine--pyruva  9.67506 9.04974 8.92981", header = T)
names(df1)[1] <- "RefSeq. ID"

df2 <- read.table(text =
    "'RefSeq. ID'  UniProt 'Protein Name'    C1      C2      C3
NP_000005   P01023  Alpha-2-macrogl 18.57345 18.45938 18.68941
NP_000021   P21549  Serine--pyruva  19.67506 19.04974 18.92981", header = T)
names(df2)[1] <- "RefSeq. ID"