合并大型数据框中的重复行

时间:2017-07-06 15:45:48

标签: r dataframe dplyr aggregate

基本上,我有一个数据框df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway8    A         G           NA           NA           E
Pathway9    A         G           Z            H            F
Pathway6    A         G           Z            H            E
Pathway2    A         G           D            NA           F
Pathway5    A         G           D            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

现在我想整合行看起来像这样:

newdf
      Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway9    A         G           Z            H            F, E
Pathway2    A         G           D            NA           F, E
Pathway1    A         D           K            NA           F
Pathway4    A         B           C            D            F, E

这是我提出的过去问题的延续(Consolidating duplicate rows in a dataframe)。这适用于此数据集,但对于我更大的数据集,它似乎不会组合这些值。例如,前几行输出(在我修改@Matt Jewett给出的代码或使用Concatenate strings by group with dplyr中提供的解释之后):

          Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway1    Smoothened    Gl-1                              Osteopontin
Pathway2    Smoothened    Gl-1      BMP2                    Osteopontin
Pathway3    Smoothened    Gl-1      BMP2                    DLX5
Pathway4    Smoothened    Gl-1      BMP2                    Osteopontin

如您所见,有几个问题。首先,Biomarker1列似乎没有聚合。其次,有几行的重复。我在解决方案方面遇到了障碍,所以你们可以想到的任何解决方案都会非常感激!

非常感谢你的帮助!

1 个答案:

答案 0 :(得分:1)

使用data.table

足够简单
library(data.table)

dat <- fread("Pathway Beginning1 Protein2    Protein3    Protein4    Biomarker1
             Pathway3    A         G           NA           NA           F
             Pathway8    A         G           NA           NA           E
             Pathway9    A         G           Z            H            F
             Pathway6    A         G           Z            H            E
             Pathway2    A         G           D            NA           F
             Pathway5    A         G           D            NA           E
             Pathway1    A         D           K            NA           F
             Pathway7    A         B           C            D            F
             Pathway4    A         B           C            D            E")

dat_collapse <- dat[, .(Pathway = Pathway[1],
                        Biomarker1 = paste0(Biomarker1, collapse = ", ")),
                    by = .(Beginning1, Protein2, Protein3, Protein4)]

setcolorder(dat_collapse, names(dat))
dat_collapse 

结果:

    Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
1: Pathway3          A        G       NA       NA       F, E
2: Pathway9          A        G        Z        H       F, E
3: Pathway2          A        G        D       NA       F, E
4: Pathway1          A        D        K       NA          F
5: Pathway7          A        B        C        D       F, E