我试图在数据框中按列聚合多行。我成功地将聚合用于一列\ o /但我不了解如何将它用于多个列。我举例说明了我的数据:
Gene_Title ID_Affymetrix GB_Acc.x Gene_Symbol.x Entrez ID_Agl GB_Acc.y Gene_Symbol.y Unigene Ensembl Chr_location
trafficking protein particle complex 4 1429632_at AK005276 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
aldo-keto reductase family 1, member B3 (aldose reductase) 1437133_x_at AV127085 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
sodium channel, voltage-gated, type I, alpha 1450120_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
sodium channel, voltage-gated, type I, alpha 1450121_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
aldo-keto reductase family 1, member B3 (aldose reductase) 1456590_x_at BB469763 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
dolichol-phosphate (beta-D) mannosyltransferase 2 1415675_at BC008256 Dpm2 13481 33459 NM_010073 Dpm2 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825
proline rich 13 1423686_a_at BC016234 Prr13 66151 4 NM_025385 Prr13 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149
transmembrane protein 2 1424711_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
transmembrane protein 2 1451458_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
lipase, endothelial 1450188_s_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421261_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421262_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
coatomer protein complex, subunit gamma 1415670_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
coatomer protein complex, subunit gamma 1416017_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
leucine rich repeat containing 1 1452411_at BG966295 Lrrc1 214345 29 NM_172528 Lrrc1 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939
aldo-keto reductase family 1, member B3 (aldose reductase) 1448319_at NM_009658 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
ATPase, H+ transporting, lysosomal V0 subunit D1 1415671_at NM_013477 Atp6v0d1 11972 11826 NM_013477 Atp6v0d1 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778
golgi autoantigen, golgin subfamily a, 7 1415672_at NM_020585 Golga7 57437 54944 NM_020585 Golga7 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919
trafficking protein particle complex 4 1415674_a_at NM_021789 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
phosphoserine phosphatase 1415673_at NM_133900 Psph 100678 57142 NM_133900 Psph Mm.271784 ENSMUST00000031399 chr5:130271500-130271441
一些gene_title(和gene_symbol)代表了几次但具有不同的ID(Affymetrix或Agilent),或具有不同的GB_Acc。一般来说,我希望每个基因只有一行,而在Ids或GB_Acc或其他列中只有不同的值: 这里我的数据是Id affymetrix:
>f=function(x){return(paste(x,collapse=","))}
>tab4=aggregate(ID_Affymetrix ~ GB_Acc.x+ Gene_Title+GB_Acc.y+Gene_Symbol.x+Entrez+Unigene+Ensembl+Chr_location+ID_Agl,data=tab3,f)
GB_Acc.x Gene_Title GB_Acc.y Gene_Symbol.x Entrez Unigene Ensembl Chr_location ID_Agl ID_Affymetrix
BC016234 proline rich 13 NM_025385 Prr13 66151 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149 4 1423686_a_at
AV127085 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1437133_x_at
BB469763 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1456590_x_at
NM_009658 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1448319_at
BC019745 transmembrane protein 2 NM_031997 Tmem2 83921 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251 23 1424711_at,1451458_at
BG966295 leucine rich repeat containing 1 NM_172528 Lrrc1 214345 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939 29 1452411_at
BC020991 lipase, endothelial NM_010720 Lipg 16891 Mm.299647 ENSMUST00000066532 chr18:75099688-75099629 52 1450188_s_at,1421261_at,1421262_at
AV336781 sodium channel, voltage-gated, type I, alpha NM_018733 Scn1a 20265 Mm.439704 ENSMUST00000094951 chr2:66173557-66173498 58 1450120_at,1450121_at
AK005276 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1429632_at
NM_021789 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1415674_a_at
NM_013477 ATPase, H+ transporting, lysosomal V0 subunit D1 NM_013477 Atp6v0d1 11972 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778 11826 1415671_at
BC024686 coatomer protein complex, subunit gamma NM_017477 Copg 54161 Mm.258785 ENSMUST00000113607 chr6:87862890-87862949 25829 1415670_at,1416017_at
BC008256 dolichol-phosphate (beta-D) mannosyltransferase 2 NM_010073 Dpm2 13481 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825 33459 1415675_at
NM_020585 golgi autoantigen, golgin subfamily a, 7 NM_020585 Golga7 57437 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919 54944 1415672_at
NM_133900 phosphoserine phosphatase NM_133900 Psph 100678 Mm.271784 ENSMUST00000031399 chr5:130271500-130271441 57142 1415673_at
正如你所看到的,对于Tmem2,Copg,Lipg和Scn1a,我现在在同一行中有几个ID_Affymetrix。对于这个基因,唯一的区别在于该专栏。但对于Akr1b3和Trappc4,GB_acc.x列中也存在一些差异。
因此,在一般情况下,我想为每个列制作一个聚合(Gene_Title和Gene_Symbol除外,它们对于给定的基因通常总是相同的),最后有例子:
Gene_Tile Gene_Symbol GB_Acc ID_Affy ...
Traffickp Prot complex 4 Trapcc4 AK005276,NM_021789 1429632_at,1415674_a_at ...
如果有人有任何想法
谢谢!
编辑: 这是dput(head(mydata,20))。最后有一些错误,但我不知道这个功能和他的目标
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
Erreur dans `?`(dput(head(tab3, 20)), dput(head(tab3, 20))) :
c("pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 11, 12, 12)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(13, 14, 20, 2, 1, 7, 6, 3, 19, 17, 8, 9, 4, 10, 15, 16, 5, 12, 11, 18)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(2, 11, 4, 12, 9, 9, 5, 13, 10, 8, 8, 8, 15, 6, 3, 3, 14, 1, 7, 7)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)",
"pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 7, 6, 6, 6, 9, 8, 10, 10, 12, 12, 11, 11)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(11677, 11677, 11677, 11972, 54161, 54161, 13481, 57437, 214345, 16891, 16891, 16891, 100678, 66151, 20265, 20265, 60409, 60409, 83921, 83921)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'a
答案 0 :(得分:3)
也许这就是你要找的东西?
library(dplyr)
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol) %>%
summarise_each(funs(paste(., collapse = ",")))
我没有用你的数据测试它,因为我无法复制并粘贴到我的会话中。
在您的数据中,您有两列Gene_Symbol.x
和Gene_Symbol.y
,可能是在merge
期间创建的。我假设它们具有相同的信息,因此您可以将代码调整为:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(., collapse = ",")), -Gene_Symbol.y)
或者,如果您只希望每列中都有唯一条目(如@ juba的答案),您可以写:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(unique(.), collapse = ",")), -Gene_Symbol.y)
希望有所帮助。
答案 1 :(得分:1)
可能以下aggregate
:
f <- function(v) {paste(unique(v), collapse=", ")}
aggregate(tab3, list(tab3$Gene_Title, tab3$Gene_Symbol.x), f)