R倍数字符串聚合

时间:2014-07-18 09:47:55

标签: r string aggregate

我试图在数据框中按列聚合多行。我成功地将聚合用于一列\ o /但我不了解如何将它用于多个列。我举例说明了我的数据:

Gene_Title                                                  ID_Affymetrix   GB_Acc.x    Gene_Symbol.x   Entrez  ID_Agl  GB_Acc.y    Gene_Symbol.y   Unigene     Ensembl              Chr_location
trafficking protein particle complex 4                      1429632_at      AK005276    Trappc4         60409   10239   NM_021789   Trappc4         Mm.29814    ENSMUST00000170082   chr9:44211918-44211859
aldo-keto reductase family 1, member B3 (aldose reductase)  1437133_x_at    AV127085    Akr1b3          11677   22      NM_009658   Akr1b3          Mm.451      ENSMUST00000166583   chr6:34253982-34253930
sodium channel, voltage-gated, type I, alpha                1450120_at      AV336781    Scn1a           20265   58      NM_018733   Scn1a           Mm.439704   ENSMUST00000094951   chr2:66173557-66173498
sodium channel, voltage-gated, type I, alpha                1450121_at      AV336781    Scn1a           20265   58      NM_018733   Scn1a           Mm.439704   ENSMUST00000094951   chr2:66173557-66173498
aldo-keto reductase family 1, member B3 (aldose reductase)  1456590_x_at    BB469763    Akr1b3          11677   22      NM_009658   Akr1b3          Mm.451      ENSMUST00000166583   chr6:34253982-34253930
dolichol-phosphate (beta-D) mannosyltransferase 2           1415675_at      BC008256    Dpm2            13481   33459   NM_010073   Dpm2            Mm.22001    ENSMUST00000150419   chr2:32428766-32428825
proline rich 13                                             1423686_a_at    BC016234    Prr13           66151   4       NM_025385   Prr13           Mm.393955   ENSMUST00000164688  chr15:102291090-102291149
transmembrane protein 2                                     1424711_at      BC019745    Tmem2           83921   23      NM_031997   Tmem2           Mm.329776   ENSMUST00000096194   chr19:21930192-21930251
transmembrane protein 2                                     1451458_at      BC019745    Tmem2           83921   23      NM_031997   Tmem2           Mm.329776   ENSMUST00000096194   chr19:21930192-21930251
lipase, endothelial                                         1450188_s_at    BC020991    Lipg            16891   52      NM_010720   Lipg            Mm.299647   ENSMUST00000066532  chr18:75099688-75099629
lipase, endothelial                                         1421261_at      BC020991    Lipg            16891   52      NM_010720   Lipg            Mm.299647   ENSMUST00000066532   chr18:75099688-75099629
lipase, endothelial                                         1421262_at      BC020991    Lipg            16891   52      NM_010720   Lipg            Mm.299647   ENSMUST00000066532   chr18:75099688-75099629
coatomer protein complex, subunit gamma                     1415670_at      BC024686    Copg            54161   25829   NM_017477   Copg            Mm.258785   ENSMUST00000113607   chr6:87862890-87862949
coatomer protein complex, subunit gamma                     1416017_at      BC024686    Copg            54161   25829   NM_017477   Copg            Mm.258785   ENSMUST00000113607   chr6:87862890-87862949
leucine rich repeat containing 1                            1452411_at      BG966295    Lrrc1           214345  29      NM_172528   Lrrc1           Mm.28534    ENSMUST00000049755   chr9:77278998-77278939
aldo-keto reductase family 1, member B3 (aldose reductase)  1448319_at      NM_009658   Akr1b3          11677   22      NM_009658   Akr1b3          Mm.451      ENSMUST00000166583   chr6:34253982-34253930
ATPase, H+ transporting, lysosomal V0 subunit D1            1415671_at      NM_013477   Atp6v0d1        11972   11826   NM_013477   Atp6v0d1        Mm.17708    ENSMUST00000013304   chr8:108048837-108048778
golgi autoantigen, golgin subfamily a, 7                    1415672_at      NM_020585   Golga7          57437   54944   NM_020585   Golga7          Mm.196269   ENSMUST00000121783   chr8:24351978-24351919
trafficking protein particle complex 4                      1415674_a_at    NM_021789   Trappc4         60409   10239   NM_021789   Trappc4         Mm.29814    ENSMUST00000170082   chr9:44211918-44211859  
phosphoserine phosphatase                                   1415673_at      NM_133900   Psph            100678  57142   NM_133900   Psph            Mm.271784   ENSMUST00000031399   chr5:130271500-130271441

一些gene_title(和gene_symbol)代表了几次但具有不同的ID(Affymetrix或Agilent),或具有不同的GB_Acc。一般来说,我希望每个基因只有一行,而在Ids或GB_Acc或其他列中只有不同的值: 这里我的数据是Id affymetrix:

>f=function(x){return(paste(x,collapse=","))}   
>tab4=aggregate(ID_Affymetrix ~ GB_Acc.x+ Gene_Title+GB_Acc.y+Gene_Symbol.x+Entrez+Unigene+Ensembl+Chr_location+ID_Agl,data=tab3,f)

GB_Acc.x    Gene_Title                                                  GB_Acc.y    Gene_Symbol.x   Entrez  Unigene     Ensembl             Chr_location                ID_Agl  ID_Affymetrix
BC016234    proline rich 13                                             NM_025385   Prr13           66151   Mm.393955   ENSMUST00000164688  chr15:102291090-102291149   4       1423686_a_at
AV127085    aldo-keto reductase family 1, member B3 (aldose reductase)  NM_009658   Akr1b3          11677   Mm.451      ENSMUST00000166583  chr6:34253982-34253930      22      1437133_x_at
BB469763    aldo-keto reductase family 1, member B3 (aldose reductase)  NM_009658   Akr1b3          11677   Mm.451      ENSMUST00000166583  chr6:34253982-34253930      22      1456590_x_at
NM_009658   aldo-keto reductase family 1, member B3 (aldose reductase)  NM_009658   Akr1b3          11677   Mm.451      ENSMUST00000166583  chr6:34253982-34253930      22      1448319_at
BC019745    transmembrane protein 2                                     NM_031997   Tmem2           83921   Mm.329776   ENSMUST00000096194  chr19:21930192-21930251     23      1424711_at,1451458_at
BG966295    leucine rich repeat containing 1                            NM_172528   Lrrc1           214345  Mm.28534    ENSMUST00000049755  chr9:77278998-77278939      29      1452411_at
BC020991    lipase, endothelial                                         NM_010720   Lipg            16891   Mm.299647   ENSMUST00000066532  chr18:75099688-75099629     52      1450188_s_at,1421261_at,1421262_at
AV336781    sodium channel, voltage-gated, type I, alpha                NM_018733   Scn1a           20265   Mm.439704   ENSMUST00000094951  chr2:66173557-66173498      58      1450120_at,1450121_at
AK005276    trafficking protein particle complex 4                      NM_021789   Trappc4         60409   Mm.29814    ENSMUST00000170082  chr9:44211918-44211859      10239   1429632_at
NM_021789   trafficking protein particle complex 4                      NM_021789   Trappc4         60409   Mm.29814    ENSMUST00000170082  chr9:44211918-44211859      10239   1415674_a_at
NM_013477   ATPase, H+ transporting, lysosomal V0 subunit D1            NM_013477   Atp6v0d1        11972   Mm.17708    ENSMUST00000013304  chr8:108048837-108048778    11826   1415671_at
BC024686    coatomer protein complex, subunit gamma                     NM_017477   Copg            54161   Mm.258785   ENSMUST00000113607  chr6:87862890-87862949      25829   1415670_at,1416017_at
BC008256    dolichol-phosphate (beta-D) mannosyltransferase 2           NM_010073   Dpm2            13481   Mm.22001    ENSMUST00000150419  chr2:32428766-32428825      33459   1415675_at
NM_020585   golgi autoantigen, golgin subfamily a, 7                    NM_020585   Golga7          57437   Mm.196269   ENSMUST00000121783  chr8:24351978-24351919      54944   1415672_at
NM_133900   phosphoserine phosphatase                                   NM_133900   Psph            100678  Mm.271784   ENSMUST00000031399  chr5:130271500-130271441    57142   1415673_at

正如你所看到的,对于Tmem2,Copg,Lipg和Scn1a,我现在在同一行中有几个ID_Affymetrix。对于这个基因,唯一的区别在于该专栏。但对于Akr1b3和Trappc4,GB_acc.x列中也存在一些差异。

因此,在一般情况下,我想为每个列制作一个聚合(Gene_Title和Gene_Symbol除外,它们对于给定的基因通常总是相同的),最后有例子:

Gene_Tile                  Gene_Symbol  GB_Acc              ID_Affy                   ...
Traffickp Prot complex 4   Trapcc4      AK005276,NM_021789  1429632_at,1415674_a_at   ...

如果有人有任何想法

谢谢!

编辑: 这是dput(head(mydata,20))。最后有一些错误,但我不知道这个功能和他的目标

structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L, 
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)", 
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma", 
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7", 
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase", 
"proline rich 13", "sodium channel, voltage-gated, type I, alpha", 
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L, 
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L, 
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at", 
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at", 
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at", 
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at", 
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L, 
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L, 
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781", 
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686", 
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789", 
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L, 
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L, 
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg", 
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a", 
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L, 
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L, 
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L, 
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L, 
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L, 
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L, 
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477", 
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385", 
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L, 
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L, 
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg", 
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a", 
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L, 
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L, 
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001", 
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647", 
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"), 
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L, 
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304", 
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532", 
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607", 
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688", 
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"), 
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L, 
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149", 
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825", 
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930", 
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919", 
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title", 
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl", 
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L, 
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)", 
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma", 
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7", 
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase", 
"proline rich 13", "sodium channel, voltage-gated, type I, alpha", 
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L, 
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L, 
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at", 
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at", 
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at", 
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at", 
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L, 
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L, 
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781", 
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686", 
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789", 
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L, 
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L, 
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg", 
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a", 
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L, 
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L, 
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L, 
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L, 
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L, 
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L, 
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477", 
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385", 
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L, 
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L, 
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg", 
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a", 
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L, 
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L, 
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001", 
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647", 
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"), 
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L, 
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304", 
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532", 
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607", 
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688", 
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"), 
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L, 
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149", 
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825", 
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930", 
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919", 
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title", 
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl", 
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
Erreur dans `?`(dput(head(tab3, 20)), dput(head(tab3, 20))) : 
c("pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 11, 12, 12)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(13, 14, 20, 2, 1, 7, 6, 3, 19, 17, 8, 9, 4, 10, 15, 16, 5, 12, 11, 18)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(2, 11, 4, 12, 9, 9, 5, 13, 10, 8, 8, 8, 15, 6, 3, 3, 14, 1, 7, 7)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", 
"pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 7, 6, 6, 6, 9, 8, 10, 10, 12, 12, 11, 11)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(11677, 11677, 11677, 11972, 54161, 54161, 13481, 57437, 214345, 16891, 16891, 16891, 100678, 66151, 20265, 20265, 60409, 60409, 83921, 83921)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'a

2 个答案:

答案 0 :(得分:3)

也许这就是你要找的东西?

library(dplyr)

dfcollapsed <- df %>%               # replace df with the name of your data frame
   group_by(Gene_Title, Gene_Symbol) %>% 
   summarise_each(funs(paste(., collapse = ",")))

我没有用你的数据测试它,因为我无法复制并粘贴到我的会话中。

更新

在您的数据中,您有两列Gene_Symbol.xGene_Symbol.y,可能是在merge期间创建的。我假设它们具有相同的信息,因此您可以将代码调整为:

dfcollapsed <- df %>%               # replace df with the name of your data frame
   group_by(Gene_Title, Gene_Symbol.x) %>% 
   summarise_each(funs(paste(., collapse = ",")), -Gene_Symbol.y)

或者,如果您只希望每列中都有唯一条目(如@ juba的答案),您可以写:

dfcollapsed <- df %>%               # replace df with the name of your data frame
   group_by(Gene_Title, Gene_Symbol.x) %>% 
   summarise_each(funs(paste(unique(.), collapse = ",")), -Gene_Symbol.y)

希望有所帮助。

答案 1 :(得分:1)

可能以下aggregate

f <- function(v) {paste(unique(v), collapse=", ")}
aggregate(tab3, list(tab3$Gene_Title, tab3$Gene_Symbol.x), f)