更直接的方法来整理具有两个相似列的数据框架?

时间:2017-06-29 09:32:31

标签: r dataframe merge duplicates multiple-columns

我正在制作一个参考表,可以轻松地在标识符和这些标识符的不同版本之间进行转换。我有一个包含多个列的合并表,其中包含来自数据库的多个版本的{ID ipi_id.xipi_id.y,下面的测试df

    protein_id    ipi_id.x numbers      ensembl_id hgnc_number hgnc_symbol entrez_id    ipi_id.y uniprot
    1       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508 IPI00954924  P00846
    2       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508 IPI00743734  P00846
    3       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508 IPI00654820  P00846
    4       COX2 IPI00916440       1 ENSG00000198712        7421      MT-CO2      4513 IPI00930721  P00403
    5       COX2 IPI00916440       1 ENSG00000198712        7421      MT-CO2      4513 IPI00017510  P00403

两列ipi_id.x和.y对于相同的条目具有不同的版本化标识符,我希望它们位于同一列中,但添加了包含其余信息的行,以便每个ipi_id都有自己的行。结果df如下:

    protein_id    ipi_id   numbers      ensembl_id hgnc_number hgnc_symbol entrez_id   uniprot
    1       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    2       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    3       ATP6 IPI00552036       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    4       ATP6 IPI00954924       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    5       ATP6 IPI00743734       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    6       ATP6 IPI00654820       3 ENSG00000198899        7414     MT-ATP6      4508   P00846
    7       COX2 IPI00916440       1 ENSG00000198712        7421      MT-CO2      4513   P00403
    8       COX2 IPI00916440       1 ENSG00000198712        7421      MT-CO2      4513   P00403
    9       COX2 IPI00930721       1 ENSG00000198712        7421      MT-CO2      4513   P00403
    10      COX2 IPI00017510       1 ENSG00000198712        7421      MT-CO2      4513   P00403 

我通过复制数据框,删除其中一个重复数据框中的.x或.y列,重命名列,然后使用rbind将两个重复的数据帧重新组合在一起并使用unique()来完成此操作。删除重复的行。

df2 <- df
#remove ipi_id.X  IPI ids from one DF
df$ipi_id.x <- NULL
colnames(df)[7] <- "ipi_id"
#remove ipi_id.y  IPI ids from the other DF
df2$ipi_id.y <- NULL
colnames(df2)[2] <- c("ipi_id")
#combine the dataframes
df3 <- rbind(df, df2)
df3 <- unique(df3)

这很笨重,我认为使用tidyrreshape2有更好的方法,但我没有找到工作示例,而且我的笨重方法也奏效了。有一个更好的方法吗?把它放在一行的方法?

此外,如果我的标签很差,请告知我以后的帖子。

这是我的df的输入变量版本:

    df <- structure(list(
    protein_id = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("ATP6", "COX2"), class = "factor"), 
    ipi_id.x = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("IPI00552036", "IPI00916440"), class = "factor"), 
    numbers = c(3L, 3L, 3L, 1L, 1L), 
    ensembl_id = structure(c(2L, 2L, 2L, 1L, 1L), .Label = c("ENSG00000198712", "ENSG00000198899"), class = "factor"), 
    hgnc_number = c(7414L, 7414L, 7414L, 7421L, 7421L), hgnc_symbol = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("MT-ATP6", "MT-CO2"), class = "factor"), 
    entrez_id = c(4508L, 4508L, 4508L, 4513L, 4513L), ipi_id.y = structure(c(5L, 3L, 2L, 4L, 1L), .Label = c("IPI00017510", "IPI00654820", "IPI00743734", "IPI00930721", "IPI00954924"), class = "factor"), 
    uniprot = structure(c(2L, 2L, 2L, 1L, 1L), .Label = c("P00403", "P00846"), class = "factor")),
    .Names = c("protein_id", "ipi_id.x", "numbers", "ensembl_id", "hgnc_number", "hgnc_symbol", "entrez_id", "ipi_id.y", "uniprot"), class = "data.frame", 
    row.names = c(NA, -5L))

1 个答案:

答案 0 :(得分:0)

你去了:

df %>% 
  unite(ipi_id, ipi_id.x, ipi_id.y, sep = "_") %>% 
  separate_rows(ipi_id, sep = "_")

它做了什么?

unite将ipi_id.x和ipi_id.y放在一个以&#34; _&#34;分隔的列中。并删除原始变量ipi_id.x和ipi_id.y。然后我们使用tidyr&#39; s separate_rows,它完全符合您的要求:它将一列作为输入,将其中的值分隔为&#34; _&#34;并在必要时复制该行。