我正在尝试在两个数据框中找到常见的基因ID。两者在行中具有相同的唯一标识符(列A)。理想情况下,我将创建一个新的数据框,该框保留行名,并将基因表达数据仅放在列中。以下是我的数据示例(关注的列是col 1,它是标识符,col 4:9,我需要进行比较):
RefSeq. ID C1 C2 C3 C4 C5 C6
NP_000005 8.57345 8.45938 8.68941 8.35913 8.48177 8.44560
NP_000010 8.32595 8.19273 8.10708 8.48156 7.99014 8.24859
我想表演的是对Refseq的匹配。 ID列,为每行匹配相似的唯一标识符。我将比较C1-C6与两个数据帧。
我至少可以使用以下代码行查看匹配项:
> x008[, 1] %in% x007[, 1]
但是对于每次比赛,它只是返回了一系列FALSE TRUE结果。然后,我尝试了以下两行代码,但均无济于事!?!
> mydata <- merge(x008, x007, by=c("RefSeq. ID"))
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
和
> match(x008$RefSeq. ID, x007$RefSeq. ID)
Error: unexpected symbol in "match(x008$RefSeq. ID"
答案 0 :(得分:2)
我无法完全重现您的问题。以下作品
merge(df1, df2, by = "RefSeq. ID")
# RefSeq. ID UniProt.x Protein.Name.x C1.x C2.x C3.x UniProt.y
#1 NP_000005 P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941 P01023
#2 NP_000021 P21549 Serine--pyruva 9.67506 9.04974 8.92981 P21549
# Protein.Name.y C1.y C2.y C3.y
#1 Alpha-2-macrogl 18.57345 18.45938 18.68941
#2 Serine--pyruva 19.67506 19.04974 18.92981
在两个"RefSeq. ID"
中, data.frame
必须是 unique 列。
df1 <- read.table(text =
"'RefSeq. ID' UniProt 'Protein Name' C1 C2 C3
NP_000005 P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941
NP_000010 P24752 Acetyl-CoA 8.32595 8.19273 8.10708
NP_000021 P21549 Serine--pyruva 9.67506 9.04974 8.92981", header = T)
names(df1)[1] <- "RefSeq. ID"
df2 <- read.table(text =
"'RefSeq. ID' UniProt 'Protein Name' C1 C2 C3
NP_000005 P01023 Alpha-2-macrogl 18.57345 18.45938 18.68941
NP_000021 P21549 Serine--pyruva 19.67506 19.04974 18.92981", header = T)
names(df2)[1] <- "RefSeq. ID"