我正在制作一个参考表,可以轻松地在标识符和这些标识符的不同版本之间进行转换。我有一个包含多个列的合并表,其中包含来自数据库的多个版本的{ID ipi_id.x
和ipi_id.y
,下面的测试df
:
protein_id ipi_id.x numbers ensembl_id hgnc_number hgnc_symbol entrez_id ipi_id.y uniprot
1 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 IPI00954924 P00846
2 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 IPI00743734 P00846
3 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 IPI00654820 P00846
4 COX2 IPI00916440 1 ENSG00000198712 7421 MT-CO2 4513 IPI00930721 P00403
5 COX2 IPI00916440 1 ENSG00000198712 7421 MT-CO2 4513 IPI00017510 P00403
两列ipi_id.x和.y对于相同的条目具有不同的版本化标识符,我希望它们位于同一列中,但添加了包含其余信息的行,以便每个ipi_id都有自己的行。结果df
如下:
protein_id ipi_id numbers ensembl_id hgnc_number hgnc_symbol entrez_id uniprot
1 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
2 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
3 ATP6 IPI00552036 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
4 ATP6 IPI00954924 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
5 ATP6 IPI00743734 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
6 ATP6 IPI00654820 3 ENSG00000198899 7414 MT-ATP6 4508 P00846
7 COX2 IPI00916440 1 ENSG00000198712 7421 MT-CO2 4513 P00403
8 COX2 IPI00916440 1 ENSG00000198712 7421 MT-CO2 4513 P00403
9 COX2 IPI00930721 1 ENSG00000198712 7421 MT-CO2 4513 P00403
10 COX2 IPI00017510 1 ENSG00000198712 7421 MT-CO2 4513 P00403
我通过复制数据框,删除其中一个重复数据框中的.x或.y列,重命名列,然后使用rbind将两个重复的数据帧重新组合在一起并使用unique()
来完成此操作。删除重复的行。
df2 <- df
#remove ipi_id.X IPI ids from one DF
df$ipi_id.x <- NULL
colnames(df)[7] <- "ipi_id"
#remove ipi_id.y IPI ids from the other DF
df2$ipi_id.y <- NULL
colnames(df2)[2] <- c("ipi_id")
#combine the dataframes
df3 <- rbind(df, df2)
df3 <- unique(df3)
这很笨重,我认为使用tidyr
或reshape2
有更好的方法,但我没有找到工作示例,而且我的笨重方法也奏效了。有一个更好的方法吗?把它放在一行的方法?
此外,如果我的标签很差,请告知我以后的帖子。
这是我的df的输入变量版本:
df <- structure(list(
protein_id = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("ATP6", "COX2"), class = "factor"),
ipi_id.x = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("IPI00552036", "IPI00916440"), class = "factor"),
numbers = c(3L, 3L, 3L, 1L, 1L),
ensembl_id = structure(c(2L, 2L, 2L, 1L, 1L), .Label = c("ENSG00000198712", "ENSG00000198899"), class = "factor"),
hgnc_number = c(7414L, 7414L, 7414L, 7421L, 7421L), hgnc_symbol = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("MT-ATP6", "MT-CO2"), class = "factor"),
entrez_id = c(4508L, 4508L, 4508L, 4513L, 4513L), ipi_id.y = structure(c(5L, 3L, 2L, 4L, 1L), .Label = c("IPI00017510", "IPI00654820", "IPI00743734", "IPI00930721", "IPI00954924"), class = "factor"),
uniprot = structure(c(2L, 2L, 2L, 1L, 1L), .Label = c("P00403", "P00846"), class = "factor")),
.Names = c("protein_id", "ipi_id.x", "numbers", "ensembl_id", "hgnc_number", "hgnc_symbol", "entrez_id", "ipi_id.y", "uniprot"), class = "data.frame",
row.names = c(NA, -5L))
答案 0 :(得分:0)
你去了:
df %>%
unite(ipi_id, ipi_id.x, ipi_id.y, sep = "_") %>%
separate_rows(ipi_id, sep = "_")
它做了什么?
unite
将ipi_id.x和ipi_id.y放在一个以&#34; _&#34;分隔的列中。并删除原始变量ipi_id.x和ipi_id.y。然后我们使用tidyr&#39; s separate_rows
,它完全符合您的要求:它将一列作为输入,将其中的值分隔为&#34; _&#34;并在必要时复制该行。