我有一个非常长的数据帧,其中近56个中的1列有许多不同的值,而其余数据根据第一列ID而变化。这是一个例子
ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486
0 chr1 1590327 1590328 a SNP Hom NM_033487
0 chr1 1590327 1590328 a SNP Hom NM_033488
0 chr1 1590327 1590328 a SNP Hom NM_033489
0 chr1 1590327 1590328 a SNP Hom NM_033492
0 chr1 1590327 1590328 a SNP Hom NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486
1 chr1 1590526 1590527 g SNP Hom NM_033487
1 chr1 1590526 1590527 g SNP Hom NM_033488
1 chr1 1590526 1590527 g SNP Hom NM_033489
1 chr1 1590526 1590527 g SNP Hom NM_033492
所需的结果是将任何重复值连接到逗号分隔的字符串中,但只保留一次ID,如下所示
ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486NM_033487,NM_033488,NM_033489,NM_033492,NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486,NM_033487,NM_033488,NM_033489,NM_033492
我搜索过类似的问题,the following solutions到目前为止还没有找到;相反,他们给我一个零行数据帧。
答案 0 :(得分:4)
data.table
的一种方式:
library(data.table)
#setDT will convert the data.frame into data.table
#.SD gives you access to the groups of data.tables created by the 'by' argument
setDT(df)[, list(transcript_name = paste(transcript_name, collapse = ', ')),
by = c('ID', 'chrom', 'left', 'right', 'ref_seq', 'var_type', 'zygosity')]
# ID chrom left right ref_seq var_type zygosity transcript_name
#1: 0 chr1 1590327 1590328 a SNP Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492, NM_033493
#2: 1 chr1 1590526 1590527 g SNP Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492
数据
df <- read.table(header = TRUE, text = 'ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486
0 chr1 1590327 1590328 a SNP Hom NM_033487
0 chr1 1590327 1590328 a SNP Hom NM_033488
0 chr1 1590327 1590328 a SNP Hom NM_033489
0 chr1 1590327 1590328 a SNP Hom NM_033492
0 chr1 1590327 1590328 a SNP Hom NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486
1 chr1 1590526 1590527 g SNP Hom NM_033487
1 chr1 1590526 1590527 g SNP Hom NM_033488
1 chr1 1590526 1590527 g SNP Hom NM_033489
1 chr1 1590526 1590527 g SNP Hom NM_033492')
答案 1 :(得分:4)
使用基础R的另一种解决方案
aggregate(data=df,transcript_name~.,FUN=paste,collapse=",")
感谢@Sotos&amp; @LyzandeR代表collapse